Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Emphasizing the importance of data and evaluation in the era of large language models
(USC Thesis Other)
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EMPHASIZING THE IMPORTANCE OF DATA AND EVALUATION IN THE ERA OF LARGE LANGUAGE MODELS by JIAO SUN A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2024 Copyright 2024 JIAO SUN Dedication To a life-changing journey with support from family and friends. ii Acknowledgements I never thought I would do a Ph.D. in the States. Looking back at these extremely exciting but challenging years, I am glad that I made that decision and stepped out of my comfort zone. There were many ups and downs throughout my Ph.D. journey, and I would not be able to make it without my family, friends and advisors. First, I would like to thank my advisor Xuezhe (Max) Ma for his constant support and trust. Max became my advisor during the second year of my Ph.D. study. It was a difficult time for me as I did not have an advisor at USC. Max saved me from the miserable advisorless status and has started to support every decision I made. He not only let me freely explore research directions, but also cares for my mental health and well-being. I am grateful to have him as my advisor. I also thank Nanyun (Violet) Peng for her mentorship and tireless support. She admitted me as an research assistant at PlusLab when I first started my journey on NLP. I would never be who I am today without her. She is super hardworking, organized and has wonderful research taste. I have learned a lot from her since Day 1. She has always been kind to our lab members and has always tried her best to help us. I still remember the laughter and joy when PlusLab members celebrated Thanksgiving together. Besides Max and Violet, my other committee members, Jon May, Emilio Ferrara and Dan O’Leary, have always been so kind and supportive to me. With them, those important milestones in my Ph.D. were no longer exams but opportunities to hear constructive feedback and communicate with the best minds. I deeply appreciate their commitment and wish to continue collaborating with all of them even after I graduate. I would like to thank Prof. Cyrus Shahabi for admitting me to USC and advising me on my first iii projects around deep learning for social good, where I learned a lot. I would like to thank Prof. Swabha Swayamdipta for working with me over a year on rationales. Her careful and rigorous research style has great influence on my research. As a member of both MaxLab and PlusLab, I thank I-Hung Hsu, Rujun Han, Hou (Hope) Yu, Derek Ma, Steeve Huang, Yufei Tian, Arshiya Aggarwal for working with me and other awesome labmates to provide feedback for both my paper draft and presentation. Besides school, I interned at Google, Amazon and IBM. I thank Dr. Sebastian Gehrmann for hosting me at Google and making it my best internship experience. He helped me technically, gave me constructive feedback for conducting influential research and also actively connected me with researchers that he knew. More preciously, he keeps introducing me even after I finished my internship, which greatly helped my career. I also thank Jacob Eisenstein, Elizabeth Clark, Tu Vu, Thibault Sellam, Timothy Dozat, Dan Garrette, Aditya Siddhant, Chunting Zhou, Omer Levy, Anjali Narayan-Chen, Shereen Oraby, Shuyang Gao, Jing Huang, Yang Liu, Vera Liao, Michael Muller, Mayank Agarwal, Stephanie Houde, Kartik Talamadupula, Justin D. Weisz at industry for working with me and providing their help. In addition, I thank Tongshuang (Sherry) Wu, Xi (Victoria) Lin, Yue Jiang, Diyi Yang, Deqing Fu, Yushi Hu and Wangchunshu Zhou for working with me in the past. Life will be no fun with friends. I would like to thank Nan (Nancy) Xu for supporting me unconditionally whenever I need help. She is my collaborator, best friend and role model. I also thank my friends Jun Yan, Fei Wang, Pei Zhou, Qinyuan Ye, Zihao He, Defu Cao and many others for giving me a hand when I need them. Last but not the least, I love my family: my mother Meirong Liu, my father Xinsheng Sun and my sister Min Sun. They are the reason that keeps me strong and resistant to all the challenges that I have been facing. I will try my best to make them proud. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Faithful and Trustworthy Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Strategic and Efficient Data Acquisition Leads To Better Models . . . . . . . . . . . . . . . 4 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Faithful Evaluation of Large Language Models . . . . . . . . . . . . . . . . . . . . . . . 7 7section.16 2.1.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Numerical Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Content-Controlled Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.4 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.5 Rationale Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.6 Controlled Paraphrase Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23section.62 2.2.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Definition of Dialect Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Existing Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.4 Testing Dialect Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4.1 Micro-level Dialect Features . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4.2 Sentence-level Dialect Rewrites . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.4.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.5 NANO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.5.1 Acceptability Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.5.2 Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 v 2.2.7 Sensitivity to Dialects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: Correlation Between Model Performance and Data Quality . . . . . . . . . . . . . . . . 38 38section.111 3.1.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1.3 Do crowdsourced rationales aid human interpretability? . . . . . . . . . . . . . . . 41 3.1.4 Preliminary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.4.1 Direct Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.4.2 Indirect Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1.4.3 Categorizing Crowdsourced Rationales . . . . . . . . . . . . . . . . . . . 45 3.1.5 Can Models Benefit from Crowdsourced Rationales? . . . . . . . . . . . . . . . . . 46 3.2 Event Gender Bias in Wikipedia and Gender Bias in LLM-Generated Recommendation Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50subsection.145 3.2.1.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1.3 Detecting Gender Biases in Events . . . . . . . . . . . . . . . . . . . . . . 55 3.2.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.2 Gender Bias in LLM-Generated Reference Letters . . . . . . . . . . . . . . . . . . . 60 3.2.2.1 Why Reference Letters? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2.2.2 Bio-based Reference Letter Generation . . . . . . . . . . . . . . . . . . . 60 Chapter 4: Efficient High-Quality Data Acquisition Leads to Better Models . . . . . . . . . . . . . . 63 63section.191 4.1.1 Motivation and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.2 ExPUN Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.2.2 Dataset Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.2.3 Dataset Statistics and Quality Control . . . . . . . . . . . . . . . . . . . . 67 4.1.2.4 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.3.1 Pun Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.3.2 Keyword-Conditioned Pun Generation . . . . . . . . . . . . . . . . . . . 75 4.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 79section.261 4.2.1 Motivation and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 DreamSync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.3 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.3.1 Training Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.3.2 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.4.1 Experimental set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.4.2 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.4.3 Analysis & Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.4.4 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 vi 4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Chapter 5: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vii List of Tables 2.1 Task illustration for the Numerical Planning Benchmark. We test LLMs’ numerical planning ability under various constraints (word counting and end word) and granularities (word, syllable, sentence, and paragraph). Due to space limitations, we only show the full constraints under the word granularity here. . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Success rates for the word count planning task. Surprisingly, few-shot in-context learning (ICL) underperforms zero-shot (zs) on numerical planning. . . . . . . . . . . . . . . . . . . 11 2.3 Results on content-constrained text generation. . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Performance of different decoding strategies and LLMs for open-ended story generation. Vicuna stands for Vicuna-7B, Falcon for Falcon-7B-Instruct. . . . . . . . . . . . . . . . . . 15 2.5 Rationales generated by ChatGPT are on par with best-crowdsourced rationales ECQA with FlanT5-XXL [chung2022scaling] as the backbone model. Ruling out leakage results in at least 5% accuracy drop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Performance comparison with ground-truth syntactic control for AESOP [141] and fine-shot ChatGPT. With coarse syntactic control from a shallow height of pruning, AESOP, the state of the finetuned small model, outperforms five-shot ChatGPT across all semantic preservation (BLUE, ROUGE Scores, and METEOR) and syntactic conformation metrics (TED-R and TED-E at the height of two) by a large margin. ↑ means higher is better, while ↓ means lower is better. By comparing ctrl with syntax explanation, we show that ChatGPT is better at mimicking the syntactic structure from an exemplar than utilizing the syntactic information directly from the syntax. . . . . . . . . . . . . . . . . . 20 2.7 Examples of generated explanations for pruned constituency parse trees by ChatGPT. . . . 22 2.8 Number of evaluation examples per language before and after semantic perturbation. The middle three rows are the number of examples to which each perturbation was applicable, and the final row Agg. is the number of examples to which at least one perturbation is applicable, which we use in our final analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 34 viii 2.9 Success Rates of > . Training with NANO starts to improve upon the strongest baseline BLEURT with mT5XL and achieves the best performance with mT5XXL. We boldface the success rates that are better than random chance (0.5) and significant after applying Bonferroni correction for multiple comparisons. Training with NANO improves dialect robustness for the XL- and base-scale model. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.10 Coefficients from the regression model for Dialect vs. Semantic Perturbation, indicating the dialect robustness, before and after using NANO. We boldface significant coefficients where NANO helps. We show that training with NANO improves the dialect robustness across all model sizes and languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Example annotations from CoS-E v1.11 and ECQA for the question “What are you waiting alongside with when you’re in a reception area?” with options 1: motel 2: chair 3: hospital 4: people 5: hotels and the correct option people. CoS-E annotation directly combines the question and the correct answer, while ECQA annotation provides additional background knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Human study directly comparing ECQA and CoS-E rationales on 120 ComQA instances, for the presence of background knowledge, and answer leakage. . . . . . . . . . . . . . . . 43 3.3 Examples of crowdsourced rationales for CoS-E and ECQA, vs. our manually constructed rationales that declaratively combine the question and the answer without providing any background knowledge or commonsense reasoning. . . . . . . . . . . . . . . . . . . . . . 43 3.4 Results from our human study via indirect assessment to compare 100 pairs of crowdsourced and constructed rationales. The IAA is 0.61. . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Our manual four-way categorization of CoS-E v1.11 (dev.) rationales, with examples. Bolded options indicate ground truth. We find that 88.45% of rationales do not provide additional background knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6 ComQA accuracies under various train (row) and test (column) settings. r1 is an I→O T5 baseline without access to rationales during training; the following rows use different amounts (%RTr) of CoS-E rationales (r2 − r6) and shuffled ECQA rationales (r7 − r11) for training IR→O T5 models. At inference time, each model predicts the label given no rationale (c1), or given the crowdsourced rationales for the entire test set (c2-c4), or a subset of the CoS-E test set (c5-c8), selected based on the rationale categories in Table 3.5. c4 and c3 report ECQA test set performance, when the test rationales are shuffled or not, respectively. We report accuracies averaged across 3 random seeds (stdev as subscript) for %R selection during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 The importance of shuffling the order of sentences in ECQA rationales in training. Without shuffling, the model relies on the spurious correlation due to sentence order, as compared to r7-11/c4 in Tab. 3.6. Accuracies are averaged across 3 random seeds (s.d. as subscript) for %R selection during training, as in Tab. 3.6. . . . . . . . . . . . . . . . . . . . 48 ix 3.8 QuaRTz model accuracy with and without training with knowledge statements as rationales. We report accuracies averaged across 3 random seeds (s.d. as subscript) for %R selection during training, as in Table 3.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.9 The marriage events are under the Career section for the female on Wikipedia. However, the same marriage is in the Personal Life section for the male. yellow background highlights events in the passage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.10 Statistics showing the number of celebrities with Career section or Personal Life section, together with all celebrities we collected. Not all celebrities have Career or Personal Life sections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.11 The performance for off-the-shelf event extraction model in both common event extraction dataset TB-Dense (TB-D) and our corpus with manual annotation. S represents the sampled data from the corpus. S-F and S-M represent the sampled data for female career description and male career description separately. . . . . . . . . . . . . . . . . . . . 54 3.12 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.13 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.14 Qualitative evaluation results on ChatGPT for biases in Lexical Content. Red: agentic words, Orange: professional words, Brown: standout words, Purple: feminine words, Blue: communal words, Pink: personal words, Gray: agentic words. WEAT(MF) and WEAT(CF) indicate WEAT scores with Male/Female Popular Names and Career/Family Words, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Two examples of annotated Keywords (KWD) and Natural Language Explanations (NLEx) for puns in our dataset. The highlighted texts are annotated keywords that contribute to making the text funny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Two examples with annotation fields that we collect. We use underline to mark the commonsense knowledge that people need in order to understand the joke. . . . . . . . . 64 4.3 Overall stats for annotation fields in ExPUN. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Agreement stats for annotated fields in the ExPUN dataset. We report averaged Cohen’s κ and Spearman’s ρ for numeric ratings (AF1 − AF4), and averaged BLEU-4 and METEOR for text fields (AF5 − AF6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Keyword annotations from different workers. wA shows aggregated keywords from our algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Pun explanations generated by the T5 model. We use underline to indicate the pun word in the input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 x 4.7 Automatic (Word Incorporation Rate) and human evaluation (Success %) of puns generated by models finetuned using automatically-extracted (RAKE) and human-annotated (ExPUN) keywords (with AmbiPun baseline [96]). PT stands for Pre-Training and FT stands for Fine-Tuning. Both T5PT+FT models finetuned with RAKE-based keywords or ExPUN-based keywords use RAKE-based keywords during pretraining. . . . . . . . . . . . . . . . . . . . 77 4.8 Examples of input pun words and keywords and the resulting generated puns. We show examples of both homographic and heterographic generated puns. . . . . . . . . . . . . . . 78 4.9 Benchmark on Text Faithfulness and Visual Appeal. All models are sampled with the same set of four seeds, i.e. K = 4. Best scores under each backbone T2I model are highlighted in bold; gain and loss compared to base models are highlighted accordingly. significantly improve SD-XL and SD v1.4 in alignment and visual appeal across all benchmark. Additionally, does not sacrifice image quality when improving faithfulness. . 87 4.10 Ablation of different VLM rewards. Models are evaluated after one iteration. . . . . . . . . 90 4.11 Scores given by the human preference model ImageReward [159]; model scores are logits and can be negative. Models trained with outperform other baselines (higher is better), without using any human annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 xi List of Figures 2.1 We test large language models on five controlled generation tasks with various control factors using automatic evaluation methods. We show a spectrum of abilities of large language models on such tasks and conclude that large language models struggle at fine-grained hard constraints such as numerical planning. . . . . . . . . . . . . . . . . . . 8 2.2 Histogram visualization in the distribution (frequency, z-axis) of input numbers (x-axis) and output numbers (y-axis) for word count planning. Left: querying ChatGPT to generate a continuation of a given prefix with N words. Right: querying ChatGPT to generate a continuation with N words of a given prefix that ends with a given word. Small red dots • mark those bars where output numbers equal input numbers. These bars represent the fine-grained success rates. For either case, there is a significant drop when the input number reaches six. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 An illustration of dialect robustness in the context of generation evaluation. We define dialect robustness as evaluation metrics that are expected to have the same output across dialects that share the same semantics. Dialect edits (highlighted in yellow) should not lead to a greater degradation of score than edits that change the underlying semantics (highlighted in underline). BLEURT-20 in the figure assigns higher score to semanticallyperturbed sentences than sentences with dialect features, exposing its vulnerability to dialects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Coefficients from the regression model for Dialect vs. Semantic Perturbation (ϕdialect vs. perturb) and MT vs. Semantic Perturbation (ϕMT vs. perturb). The higher ϕdialect vs. perturb is, the more dialect-robust a metric is. Error bars show 99% confidence intervals; they are larger for the English evaluations because there is less data. ϕMT vs. perturb serves as a stress test to measure evaluation metrics’ abilities of recognizing semantic changes. We show that evaluation metrics are good at recognizing semantic changes but not dialect changes. For all metrics except BLEURT and NANO, ϕdialect − ϕperturb is negative for at least one language, exposing their vulnerability to dialects. . . . . . . . . . . . . . . . . . . . . . . . 35 xii 3.1 Illustration of our investigation into free-form rationales for commonsense QA from CoSE [150] and ECQA [2]. We conduct human studies to understand perceived usefulness of rationales, by asking if they contain background knowledge necessary to answer a question (yellow highlights). We also investigate if rationales leak the answer to models that use them as additional training signals. Our work compare rationales from different sources, and finds that ECQA rationales are preferable to CoS-E rationales on various axes. Finally, we find that crowdsourced rationales also offer greater benefits to both humans and models than generated rationales. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 The percentile of extracted events among all detected events, sorted by their frequencies in descending order. The smaller the percentile is, the more frequent the event appears in the text. The extracted events are among the top 10% for the corresponding gender (e.g., extracted female events among all detected events for female writers) and within top 40% percent for the opposite gender (e.g., extracted female events among all detected events for male writers). The figure shows that we are not picking rarely-occurred events, and the result is significant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Distributions of (a) number of tokens and (b) number of sentences in explanations (AF5), (c) tokens in keyword phrases (AF6), and (d) keyword phrases per sample. Horizontal lines are used to show the min, mean, and max values for each distribution. . . . . . . . . 70 4.2 The impact of using human-written (4.2a) and model-generated explanations (4.2b and 4.2c) vs. no explanations (constant dotted lines) on pun classification accuracy. All reported numbers are computed with three-seed average. For each data point, we train a model on the full dataset, but only provide explanations for a given percentage, as shown on the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 DreamSync. Given a prompt, a text-to-image generation model generates multiple candidate images, which are evaluated by two VLM models: one VQA model that provides feedback on text faithfulness and the other on image aesthetics. The best image chosen by the VLMs are collected to fine tune the T2I model. This process can repeat indefinitely until convergence on feedback is achieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Qualitative examples of DreamSync improving image-text alignment after each iteration. LoRA fine-tuning on generated and filtered prompt-image pairs can steer the model to gradually capture more components of the text inputs. . . . . . . . . . . . . . . . . . . . . 83 4.5 PaLM-2 generated training prompts and their corresponding images generated via DreamSync. Prompt acquisition requires no human effort. It enables us to train on more complex and diversified prompt-image pairs than found in typical datasets. . . . . . . . . 86 4.6 DreamSync improves faithfulness and aesthetics iteratively. More examples pass the filters with additional iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Human study with three raters on 1060 DSG prompts. . . . . . . . . . . . . . . . . . . . . . 91 xiii Abstract Large Language Models (LLMs) are powerful and have revolutionized areas such as language understanding and generation. Given the broad impact that LLMs have been making, it is hard to systematically evaluate where the models make mistakes or underperform. This calls for a pressing need for careful and nuanced model evaluation methods. Equipped with reliable evaluation metrics that we develop, we find that existing LLMs are not perfect and could be biased towards certain demographic groups. Digging deeper into the development cycle, we find that the data quality of both the pretraining and finetuning heavily impacts the LLM performance (e.g., fairness and alignment). Therefore, it is important to ensure the data quality during the LLM development. To get good quality data, this thesis covers two ways: human annotation and careful synthetic data generation. For human annotation, we either define new tasks and work with crowd workers to ensure high annotator agreement and data quality, or conduct strategic data acquisition including scraping high-quality content to get targeted data for model training. For synthetic data generation, we rely on feedback from additional AI models to select good-quality samples to improve the model quality iteratively. The overarching goal of this thesis is to advance responsible LLM development by building robust evaluation metrics and developing smart data acquisition techniques. Ultimately, this aims to ensure alignment with human values and needs in the evolving landscape of artificial intelligence. xiv Chapter 1 Introduction 1.1 Thesis Statement Large-scale models have marked the beginning of a new era, significantly transforming language understanding, text and image generation, and complex decision-making tasks. For example, they may perpetuate stereotypes or produce misleading information. Nevertheless, due to the limitations of existing evaluation methods, these problems are often overlooked. This underlines the imperative need for more careful and nuanced model evaluation and assessment. Upon investigation, large models may have generated biased or inappropriate content due to inadequacies in their training data. For instance, Wikipedia, often used as a source for training neural models, contains articles where contributors list personal life in the professional sections for women but not for men [61]. This could lead to gender biases in models that trained on them and negatively impact downstream tasks, such as using LLMs for writing recommendation letters [151]. Identified through trustworthy evaluation methods, my thesis addresses these challenges with a focus on aligning large models with human intentions from a data-centric perspective. I am dedicated to developing data-efficient training strategies to lessen the dependence on vast datasets and innovating in evaluation methodologies. This dual emphasis on data optimization and enhanced evaluation is vital for the responsible progression of large-scale models, ensuring they are not only technologically advanced but also align with human values and needs. 1 The fair and robust evaluation methods not only guide the model towards better alignment during development but also provide deeper insights into the capabilities of LLMs. Guided by such metrics, I place a strong emphasis on enhancing data efficiency and inclusiveness for all demographic groups throughout the generative model development process. This focus significantly improves the efficiency and fairness of large models, while also recalibrating the balance between data quantity and quality. My data acquisition efforts involve meticulously curating diverse datasets from various sources [171], repurposing existing large corpora to meet specific modeling objectives [62], and innovatively curating datasets using feedback from AI models [57]. Models developed with this high-quality data are then re-evaluated, with the results serving both as a testament to the data’s effectiveness and as a tool to identify any issues in the new models. This iterative process of evaluation and data acquisition continues until we achieve our ultimate goal: a trustworthy LLM that is closely aligned with human intentions. 1.2 Faithful and Trustworthy Evaluation Metrics Often, models meticulously crafted in controlled lab environments encounter unforeseen challenges when deployed in real-world settings. This discrepancy arises because the development environment, characterized by clean and high-quality inputs, may not accurately represent the complexities of real-world applications. For example, consider a language model trained primarily on formal, written text. While it may perform exceptionally in generating academic-style articles, it could struggle significantly in understanding and responding to colloquial, region-specific slang or dialects encountered in everyday conversations. Consequently, there is a growing need for faithful and trustworthy evaluation metrics that end users can rely on to assess a model’s quality. Such metrics not only enhance trust between humans and AI, as evidenced in our findings [59], but also provide essential tools for developers to gauge model performance. These metrics help in pre-empting and addressing potential safety and fairness concerns, thereby guiding responsible model development. My research has revealed several critical insights about LLMs: their 2 limited efficacy in controlled generation tasks like simple numerical counting [64], their perceived greater utility in highly-developed countries [58], and their propensity to perpetuate gender stereotypes in professional writing tasks, such as drafting recommendation letters [151]. These findings highlight the role of evaluation as crucial sentinels for LLMs. They bring to light significant safety concerns and emphasize the need for cautious use of LLMs by end users. While substantial progress has been made in developing neural-based evaluation metrics capable of assessing various aspects of models, these metrics themselves often fall short in terms of robustness and fairness. For instance, the Regard Score, designed to evaluate gender bias in text generation models, demonstrates brittleness: simple paraphrasing within its calculation can lead to drastically different outcomes, revealing its inherent instability [1]. Similarly, prevalent evaluation metrics inadequately recognize dialects, displaying a bias towards what is commonly perceived as standard dialects. This is evident in cases where American English consistently outscores Indian English, irrespective of the reference texts used [62]. Such biases not only disadvantage demographic groups speaking these dialects but also hinder their access to AI tools evaluated by these metrics. As a matter of fact, most existing evaluation metrics do not consider dialect diversity in their training, resulting in their failure to recognize various dialects. To counter this, we have introduced NANO, an innovative pretraining step that extracts dialectal information from a vast corpus (i.e., mT5) and explicitly teaches the model to discern dialectal differences during pretraining. This approach has shown promise in enhancing the dialectal robustness of evaluation metrics. Moving forward, my focus will remain steadfast on developing and refining evaluation metrics that are both fair and robust, ensuring they can faithfully and accurately evaluate large-scale models. The effectiveness of NANO shows that the biases in evaluation metrics stem largely from their data acquisition processes. In the following section, we delve deeper into the critical role of effective data acquisition in the development of large models, and we elaborate on our specific approach to tackling this challenge. 3 1.3 Strategic and Efficient Data Acquisition Leads To Better Models In the current landscape of large models, a prevailing belief is that more data equates to more intelligence, often leading to an oversight of data quality. This has sparked an implicit race to amass as much data as possible for training large language models. However, the extent of impact that data quality has on model utility remains an under-investigated area. I advocate for the principle that only high-quality data can truly advance the progress of developed models, while poor data quality could be counterproductive. This belief has driven my recent investigations into the correlation between data quality and model performance. Starting with rationales, as extensive research has shown their effectiveness in improving reasoning abilities of large language models [jasonwei], my study of two major human-annotated rationale datasets [63] revealed that over 97% of rationales in one dataset did not enhance human interpretability or model reasoning. Contrarily, incorporating just 5% of high-quality rationales resulted in a 50% boost in model performance. This finding sounded an alarm and paved the way for questioning the significance of high data quality and exploring methods to obtain it. In LIMA [171], we examined how minimal data can be used for aligning pretrained language models with user intentions. Remarkably, with only 1,000 carefully curated prompts and responses, LIMA demonstrated robust capabilities that rival commercial closed-source models like Claude and GPT-4, excelling in tasks requiring complex response structures, such as organizing travel plans and hypothesizing about historical scenarios. Additionally, LIMA’s ability to adapt to new tasks, not included in its initial training, was exceptional. Notably, this was achieved without reinforcement learning or human preference modeling, while such methods have been deemed crucial since the introduction of ChatGPT [10]. LIMA’s success not only highlights the significance of data quality in the era of large language models but also suggests that supervised fine-tuning (SFT) could be an effective alternative to reinforcement learning from human feedback (RLHF) methods. 4 Emphasizing the principle that data quality is paramount for alignment, my research has branched into the dynamic and increasingly influential field of text-to-image generation. This area, while distinct, complements the advances made in language modeling and represents a significant frontier in AI development. In a significant shift from advanced methods reliant on human feedback, my project DreamSync [57] employs an innovative approach where the model itself learns from AI-generated feedback. This process involves automatically collecting high-quality samples and using them for self-improvement. DreamSync’s objective is to enhance the congruence between text prompts and the resultant images while also improving their visual quality. To accomplish this, we harness the capabilities of two vision-language models: a Visual Question Answering (VQA) model that assesses the fidelity of generated images to the textual prompts, and another model that evaluates their aesthetic appeal. The text-to-image generation model initially produces a set of images, from which the two vision-language models select the best ones based on predetermined criteria for faithfulness and aesthetics. These high-quality images are then utilized to fine-tune the text-to-image generation model in a process repeated over multiple iterations. The success of DreamSync not only underscores the critical role of data quality in text-to-image generation models but also demonstrates how superior image quality can effectively guide the generation process towards better alignment and aesthetics, all without relying on reinforcement learning (RL) or human annotation. This makes DreamSync a versatile and cost-effective tool for enhancing any text-to-image model, irrespective of its underlying architecture. LIMA and DreamSync highlight my commitment to emphasizing data quality, harvesting superior data, and developing models grounded in this data across the two pivotal domains of generative AI: large language models and text-to-image generation. In addition to these innovative approaches, I have also extensively engaged in traditional high-quality data acquisition methods through human annotation. This has been especially crucial in techniques like RLHF, particularly in novel contexts. My collaboration with 5 annotators across multiple rounds of training and rigorous data collection has laid the groundwork for a variety of new tasks. 1.4 Outline of the Thesis In this thesis, we will first talk about the importance of developing automatic and robust evaluation metrics to gauge the performance of language models in Chapter 2, where we introduce two publications. Through evaluating the large language models on controlled generation tasks 2.1, we discover the deficiencies of large language models in fine-grained hard constraints with automatic evaluation metrics. However, these evaluation metrics are not always dependable. For example, they may not account for dialect differences, which we verify and improve in Section 2.2. We then propose to improve the utility of large language models from the data perspectives. Two pieces of works in Chapter 3 suggest strong correlation between data quality and model utility and provide strong foundation to our data-centric approach. To acquire highquality data 4 for generation tasks, we carefully design and quality-control our data collection process as in ExPUNations 4.1, or use image understanding feedback as the signal to iteratively filter synthetic data and build better text-to-image generation models 4.2. 6 Chapter 2 Faithful Evaluation of Large Language Models 2.1 Evaluating Large Language Models on Controlled Generation Tasks 1 2.1.1 Motivation and Contribution Text generation models should generate texts that meet controllable constraints as humans wish [167]. For example, one can avoid the blandness caused by repetitive patterns by controlling the syntax of generated sentences [53, 113, 141]. In a customized dialogue system, one should be able to control the persona of the utterance [135]. Previous works either finetune generation models such as BART [81] on specific tasks for better controllability or design constrained decoding strategies (e.g., look-back decoding strategy by [160]) for controlled generation. Large Language Models (LLMs) have recently shown great potential in various generation tasks. For example, [56] shows that ChatGPT with GPT-4 as an engine achieves commercial-level machine translation quality. [76] find that annotators prefer summaries generated from ChatGPT over state-of-the-art summarization models. However, few works investigate the controllability of large language models. Towards this end, we aim to study and understand the controllability of large language models to answer the question: Are large language models better than finetuned smaller models at controllability on generation tasks?. 1 Please refer to the published work [64]. 7 Task Control Benchmark Evaluation numerical planning prefix & number of words & end word NPB MSE, success rate constrained content generation sentiment, topic, keyword Amazon Review CommonGen M2D2 off-the-shelf model, ppl story generation prefix rationale generation correct answer ROC writing prompts repetition, diversity, coherence CoS-E ECQA increased accuracy paraphrase generation semantic & syntax ParaNMT QQPPoS lexical overlapping, syntax match good poor Figure 2.1: We test large language models on five controlled generation tasks with various control factors using automatic evaluation methods. We show a spectrum of abilities of large language models on such tasks and conclude that large language models struggle at fine-grained hard constraints such as numerical planning. The main contribution of this work is the conduction of a comprehensive analysis of LLM’s controllability on five tasks and ten generation benchmarks, including controlled story generation, controlled free-form generation with sentiment and topics, controlled paraphrase generation, and controlled rationale generation as shown in Figure 2.1. We further design a new simple yet challenging benchmark named Numerical Planning Benchmark (NPB), where the task is to satisfy numerical constraints from four granularities (word-, syllable-, sentence- and paragraph-level) and under different content controls (e.g., prefix and ending). For evaluation, we use automatic metrics, which are imperfect yet convenient and reproducible. After an in-depth examination, we categorize LLM’s controllability on a spectrum: from lagging behind and being on par with to surpassing smaller finetuned models. Our findings indicate that large language models have difficulties adhering to specific hard constraints, such as numerical planning. 2.1.2 Numerical Planning Can LLMs count from two to ten? 8 Granularity Task Illustration Word/Syllable Generate a sentence using exactly 5 words/syllables. Complete sentence “This is a story” using exactly 5 words/syllables. Complete sentence “This is a story” using exactly 5 words/syllables, including the last word as “town”. Sentence Generate a paragraph with 5 sentences, ... Paragraph Generate an article with 5 paragraphs, ... Table 2.1: Task illustration for the Numerical Planning Benchmark. We test LLMs’ numerical planning ability under various constraints (word counting and end word) and granularities (word, syllable, sentence, and paragraph). Due to space limitations, we only show the full constraints under the word granularity here. Task Description. We introduce the Numerical Planning Benchmark (NPB) as an intuitive task that tests the basic numerical planning ability of LLMs. The high-level task descriptions can be found in Table 2.1. We are inspired by real-world scenarios such as creative writing. For example, writers may wish to generate sentences or poems with a specific structure, such as a fixed number of words or syllables in each line, aiming to adhere to particular forms (e.g., sonnets, where each line contains exactly 10 or 11 syllables [147]). Meanwhile, humans may also want full control over the start and end of each line for rhetorical purposes such as alliteration and rhyming. Inductively, we formulate our numerical planning benchmark from four different granularities: generating a piece of text that contains a predefined number of words, syllables, sentences, or paragraphs given a plausible pair of prefix (start) and suffix (ending) as constraints. The prefix is given to LLMs such that they are only queried to generate the continuations. Evaluation Metrics. We use success rate (SR) and mean squared error (MSE) as automatic evaluation metrics. As our control is two-fold, we separately calculate the success rates of 1) generating the continuation with the correct counts and 2) generating the continuation with the proper ending. We also calculate the MSE between our input numbers and output numbers. 9 Figure 2.2: Histogram visualization in the distribution (frequency, z-axis) of input numbers (x-axis) and output numbers (y-axis) for word count planning. Left: querying ChatGPT to generate a continuation of a given prefix with N words. Right: querying ChatGPT to generate a continuation with N words of a given prefix that ends with a given word. Small red dots • mark those bars where output numbers equal input numbers. These bars represent the fine-grained success rates. For either case, there is a significant drop when the input number reaches six. Evaluate with LLMs. We evaluate ChatGPT and Alpaca-7b on our NPB benchmark in zero-shot and few-shot settings. Each request used to query the LLMs corresponds to a real case in the datasets of Romance Books and Reddit Short Stories.2 For word-level planning tasks (word and syllable count), we randomly select sentences from the above datasets. Then, we select the last word in each sentence as the suffix. Depending on how many additional words we query the LLMs to generate, we select the first few words in each sentence as the prefix (if we simply ask LLMs to generate freely without a prefix, the outputs lack diversity). Our prompt is written as Complete a sentence that starts with {prefix} using exactly {N} additional words (including the last word {last word}). The sentence must end with the word {last word}. Sentence: {prefix}, and LLMs will continue. In the few-shot setting, we provide the task description and three examples. For each example, we also provide explanations to help LLMs better understand our task. For example, 2 huggingface.co/datasets/AlekseyKorshuk/romance-books, www.kaggle.com/datasets/trevordu/reddit-short-stories 10 Model SR - count SR - last word SR - both MSE - count GPT-2 (fine-tuned) 0.64 0.86 0.60 1.62 Alpaca-7bzs 0.17 0.31 0.09 9.19 Alpaca-7bICL 0.14 0.34 0.07 9.76 Vicunazs 0.08 0.09 0.03 27.68 VicunaICL 0.13 0.30 0.04 13.43 Falconzs 0.13 0.42 0.08 11.60 Falcon-7bICL 0.11 0.34 0.03 13.72 ChatGPT 0.41 0.74 0.36 3.64 ChatGPTICL 0.37 0.78 0.34 4.95 Table 2.2: Success rates for the word count planning task. Surprisingly, few-shot in-context learning (ICL) underperforms zero-shot (zs) on numerical planning. ##Prefix: This is a story about a young girl’s ##Last word: town ##N: 5 ##Output: This is a story about a young girl’s redemption in a small town. ##Explanation: We generated “redemption in a small town”. It contains exactly 5 words and ends with the last word ‘town’. We query the LLMs to generate outputs from N = 2 to N = 10 words. Each number N has 100 evaluation samples. For paragraph-level tasks, the prefix and suffix are the first and last sentences in the corresponding paragraphs. For all experiments, our decoding strategy is top p (p = 0.95) sampling with temperature T = 0.3 unless otherwise specified. Result. We report the model performance of LLMs and a fine-tuned GPT-2-large model on the task of word count planning in Table 2.2. First, it is clear LLMs are poor at numerical planning, although it is an extremely simple task for humans. Given its extremely poor performance, we consider Alpaca incapable of doing numerical planning. Secondly, LLMs learn to incorporate literal constraints, such as the last word, via few-shot in-context learning. Interestingly, few-shot in-context learning deteriorates the performance of numerical planning. Upon further inspection, we find that LLMs try to mimic the style or features 11 (such as length) in the in-context examples and are, therefore, more likely to generate outputs with the wrong word counts once the input number N cannot be found in the examples. Our results resonate with [163, 73, 134] that LMs do not truly understand task definitions via in-context learning. Figure 2.2 is a fine-grained visualization of the input and output numbers distribution by zero-shot ChatGPT. Specifically, we compare LLMs’ numerical planning abilities with (e.g., complete sentence with “redemption in a small town” using exactly 5 words, including the last word as “happy”) and without additional suffix constraint (e.g., complete sentence with “redemption in a small town” using exactly 5 words). LLMs can generate more freely without suffix constraints to meet the numerical constraint. However, ChatGPT doesn’t always translate to a higher success rate. We find out that only when N is small (i.e., 2 and 3), ChatGPT achieves a higher success rate if explicitly told the last word of the target sentence. Finally, we would like to point out a few behaviors. First, although the general trend is that LLMs’ numerical planning ability drops as N increases, N = 3 is a clear exception (performs worse) among various experiments we repeated. Second, by checking the failure cases, we find that ChatGPT always generates shorter continuations than required. Moreover, we see a sudden drop in model performances (from above ∼0.6 to ∼0.4) when the input number N increases from 5 to 6. We encourage future research to investigate these behaviors. 2.1.3 Content-Controlled Generation Task Description. We consider three types of content constraints: topic, sentiment, and keyword. Topic constraint. It requires the model to generate texts about certain topics. Traditional methods for topic constrained generation either append a special token for different topics [12] or use trained topic classifiers [114] to guide the generation process. 12 Sentiment constraint. Similar to topic constraint, this task requires the model to generate texts of certain sentiments. The aforementioned methods for topic constrained generation also apply to sentiment constrained generation. Keyword constraint. Keyword constrained, or lexical constrained text generation requires the model to generate texts that contain certain keywords or tokens. Traditional methods for keyword constrained text generation generally enforce lexical constraints on the outputs by modifying the search space according to the constraints. Evaluation Metrics. We use the success rate as the evaluation metric to measure how well LLMs can follow the content constraints. Specifically, we use GPT-3.5 [101] based topic/sentiment classifiers with in-context learning using five examples per category to evaluate whether the generated texts belong to the specified topic or sentiment class. We consider an LLM to succeed in one example if the predicted class of the generated text is identical to the input constraint. For a keyword-constrained generation, we use the keyword coverage metric that measures the percentage of input keywords included in generated texts. Evaluate with LLMs. For the content constrained generation with LLMs, we follow [173] and use natural language instructions to prompt LLMs. Specifically, we use a prompt template of “Write a sentence about {topic name}” for topic-constrained generation, “Write an Amazon review with {level number} star about a random thing. The number of stars ranges from one to five. One star is the most negative, and five stars are the most positive” for sentiment constraints, and “Write a sentence using the following keywords: {keywords}” for keyword constraints. In addition to zero-shot evaluation, we also evaluate LLMs in the in-context learning setting by appending the following demonstration template: “Below are some examples for the task: Input: {input 1}, 13 Model Topic Sentiment Keyword Diffusion-LM 68.9 83.7 93.2 GPT-2 (1.5B, fine-tuned) 63.4 76.5 88.9 T5 (3B, fine-tuned) 67.3 83.9 94.8 LLaMA-7Bzs 45.3 58.4 83.5 LLaMA-7BICL 63.5 85.1 93.0 Alpaca-7Bzs 58.9 78.4 91.2 Alpaca-7BICL 65.2 86.9 94.8 Vicuna-7Bzs 61.0 80.5 91.6 Vicuna-7BICL 65.8 87.4 94.3 Falcon-7Bzs 61.9 81.0 92.1 Falcon-7BICL 66.0 87.7 94.2 ChatGPTzs 66.4 84.5 97.3 ChatGPTICL 88.4 90.3 98.1 Table 2.3: Results on content-constrained text generation. Output: {output 1}; Input: {input 2}, Output: {output 2} ... ”. We use 5 in-context examples per class following the practice in [173]. We compare various LLMs including ChatGPT, LLaMA, Alpaca, Vicuna, and Falcon in our experiments. We also report the results of Diffusion-LM [84] based on BERT-large [28] and task-specific classifiers as a competitive non-LLM baseline Results. The results are shown in Table 2.3. We find that Alpaca significantly outperforms LLaMA in the zero-shot setting. This is intuitive since natural language instruction of constraints resembles instruction tuning data. However, this performance gap is significantly reduced when in-context learning is used. We think this is because the role of instruction tuning is mainly to adapt an LLM to human-friendly prompt formats instead of increasing the LLM’s capability. We also find that ChatGPT achieves competitive performance without in-context learning and outperforms Diffusion-LM, a competitive supervised baseline, by a large margin. Moreover, the performance of ChatGPT can be further improved by adding in-context examples to the prompt. This suggests that LLMs’ ability to follow content constraints expressed in natural language depends on three confounding factors: instruction tuning or supervised fine-tuning, in-context learning, and model capacity. 14 LM Method rep-2↓ rep-3↓ rep-4↓ diversity↑ coherence↑ ROC Human 1.74 0.32 0.04 0.97 0.48 GPT-2-XL Nucleus 1.80 0.35 0.12 0.97 0.33 Typical 2.06 0.4 0.16 0.97 0.33 η-sampling 0 0 0 1.00 0.34 SimCTG 3.10 0.46 0.23 0.96 0.32 Look-back 7.24 0.92 0.14 0.92 0.47 LLM Vicuna 2.36 0.45 0.15 0.97 0.60 Falcon 2.52 1.87 1.86 0.94 0.69 ChatGPT 1.18 0.10 0.02 0.98 0.52 Writing Promts Human 15.61 3.78 1.24 0.80 0.31 GPT-2-XL Nucleus 5.40 2.41 1.72 0.91 0.34 Typical 3.60 1.51 1.10 0.94 0.30 η-sampling 6.17 2.88 2.16 0.89 0.35 SimCTG 2.84 0.36 0.19 0.97 0.31 Look-back 7.94 1.25 0.33 0.91 0.52 LLM Vicuna 8.27 2.59 1.14 0.88 0.49 Falcon 11.20 7.79 6.94 0.76 0.53 ChatGPT 5.99 1.15 0.35 0.92 0.52 Table 2.4: Performance of different decoding strategies and LLMs for open-ended story generation. Vicuna stands for Vicuna-7B, Falcon for Falcon-7B-Instruct. 2.1.4 Story Generation Task Description. Given the beginning text of a story, open-ended story generation aims to decode texts that are coherent with previous topics, and informative without undesired repetitions [139, 140, 160]. Despite the impressive success on generating fluent and accurate sentences for low-entropy tasks such as summarization or translation, large-scale language models (LLMs) still suffer from serious degeneration problems, such as undesired repetitions [47, 139] and unnatural topic drifts [83], under open-ended settings. Datasets. We evaluate different generation methods on two popular benchmark story datasets: ROCStories and Writing Prompts. ROCStories (ROC) [97] is a corpus comprising commonsense stories written by crowd-sourced workers within 5 short sentences. Given the first sentence as a prefix, generation methods 15 are required to produce four continuing sentences. Writing Prompts (WP) is a challenging task for inspiring continuations with abstract, high-level story prompts submitted by online users and continuations by others on Reddit [31]. Following prior literature [160], we utilize the first 32 tokens as the prefix and ask for continuation with 256 tokens. Since we prompt different language models or decoding algorithms without extra fine-tuning, we directly sample 1,000 development and 1,000 testing instances from both ROC and WP. Baselines. We evaluate the pre-trained LLM, GPT-2-XL [116], with search (SimCTG [139] and Lookback [160]) and sampling decoding methods (Nucleus sampling [47], Typical decoding [93] and η-sampling [45]). Evaluation Metrics. Following open-ended story generation literature [139, 83, 160], we adopt the following automatic metrics to evaluate generation quality: 1) rep-n to measure sequence-level repetition according to the portion of duplicate n-grams [153]; 2) diversity to assess the overall model repetition by considering rep-n at different n-gram levels; 3) coherence measured as the cosine similarity between prefix and continuation embeddings represented by SimCSE [38]. We do not report MAUVE [107] score due to the concern that MAUVE may not accurately reflect human preferences considering contradicted results between MAUVE and human evaluations observed in prior work [140]. Evaluate with LLMs. Chatbots that fine-tune LLMs on instructions are also evaluated: Vicuna-7B [20], Falcon-7B-Instruct [4] and ChatGPT. We prepend the following instruction before the story prefix as prompt: 1) ROC: “Please continue writing this story within 4 very short sentences: <prefix>”, 2) WP: “Please continue writing this story within 256 words: <prefix>”. Results. As shown in Table 2.4, both Vicuna-7B and ChatGPT are able to continue writing more fluent and coherent stories on both ROC and WP compared with other decoding methods based on GPT2-XL. 16 I→O 0.87 I+RCoS-E →O 0.92 I+RECQA →O 0.99 Model Leakage Non-Leakage I+RAlpaca-7B →O 0.91 0.86 I+RLLaMA-7B →O 0.87 0.79 I+RVicuna-7B →O 0.95 0.74 I+RFalcon-7B →O 0.83 0.65 I+RChatGPT →O 0.98 0.93 Table 2.5: Rationales generated by ChatGPT are on par with best-crowdsourced rationales ECQA with FlanT5-XXL [chung2022scaling] as the backbone model. Ruling out leakage results in at least 5% accuracy drop. Falcon-7B-Instruct obtains consistently lower diversity than other baselines, while ChatGPT achieves more robust performance in terms of diversity and coherence on both datasets. 2.1.5 Rationale Generation Task Description. Free-form rationales are known to aid model interpretability by providing additional world knowledge or commonsense reasoning steps [70, 85, 5]. [152] show that rationales can improve large language models’ ability to solve complex reasoning tasks. Extractive rationales in question-answering tasks are based on the input passage to extract related information to answer the question. Conversely, free-form rationales in the question-answering tasks are open-ended and condition on purely the question and options. [142] studies how different the quality of rationales would impact rationales’ utilities in terms of improving the model performance and claims that crowdsourced rationales are superior to generated rationales. [142] finetunes T5-base for both rationale generation and question answering. With the power of LLMs, we want to revisit the problem and see whether the utility of generated rationales conditioned on the question and options has been improved. Evaluation. We follow previous works and use the performance gap before and after adding rationales in the input to measure the utility of rationales, written as acc(I+R→O) - acc(I→O), where I stands for 17 question and options as input, R stands for rationales, and O stands for one of the options as output. For the backbone model for question answering, we use flanT5-XXL [24] instead of T5-base as it can handle longer sequences and is better at reasoning. [142] shows that two factors are mainly affecting the utility of rationales. One is leakage, which means that the correct answer is explicitly written in the rationales, and one can choose the correct answer among all the options by rationales without knowing the questions. The other is background knowledge, which is the additional background knowledge or reasoning step that can help answer the question. Datasets. CoS-E [cose] and ECQA [2] are the most popular free-form rationale datasets through crowdsourcing. ECQA builds on CoS-E and improves the quality of the CoS-E dataset from various aspects, including completeness, comprehensiveness, coherence, etc. They share the same sets of questions and options. Based on the findings from [142], both CoS-E and ECQA tend to leak the correct answer in the rationale, while ECQA rationales contain the background necessary to answer the questions. We conduct our analysis on question-answer pairs from the test set. Based on the evaluation acc(I+R→O) - acc(I→O), since we are evaluating on the same set of question-answer pairs, acc(I→O) is always the same. Therefore, we only compare acc(I+R→O) with different LLMs. Evaluate with LLMs. We prompt LLMs to provide background knowledge that can help answer the question and control whether to leak the correct options in rationales. We use ChatGPT as the example for illustration: • Leakage. We have ChatGPT take the role of A teacher who is trying to explain to students the rationale behind choosing the correct option for a multiple-choice question. Then prompt it with Question: {question} Options: {concatenated options} Explain the rationale behind choosing the correct option “{correct answer}”. 18 • Non-leakage. The role of ChatGPT becomes A teacher who is trying to explain to students the rationale behind a multiple-choice question. However, you do not want to leak the correct answer directly. and prompt it with Question: {question} Options: {concatenated options} Explain the rationale behind choosing the correct answer. Do not mention the correct answer “{correct answer}” explicitly. We highlight the difference between the two modes with underline. When prompting LLaMA and Alpaca, we remove the role description and only use the prompts. Through analysis, we aim to answer two questions: 1) Are LLM-generated rationales on par with crowdsourced rationales? 2) How much would leakage impact the utility of rationales? Result. Compared to T5, FlanT5 has better reasoning abilities [24] and is more capable of understanding instructions. Therefore, we use FlanT5 instead of using T5 as the backbone model for question answering, which can theoretically examine the utility of rationales better ruling out the incapability of models. Simply given the question and the option strings, Table 2.5 shows that FlanT5-XXL has an accuracy of 0.87 (while T5 in [142] scores 0.57 under the same setting). We then show the performance with crowdsourced rationales from both ECQA and CoS-E. With crowdsourced rationales from ECQA, the model almost solved the task and reached a performance of 0.99. With CoS-E rationales, the accuracy is 0.92. Our finding echoes with [142] that ECQA rationales are better quality. We then evaluate the utility of LLM-generated rationales under both the Leakage and Non-leakage scenarios. As the majority of crowdsourced rationales contain leakage [142], we consider it fair to compare LLM-generated rationales under the Leakage scenarios against crowdsourced rationales. We have two major findings: • ChatGPT generated rationales are on par with ECQA rationales from crowdsourcing. • We quantify the influence of leakage in measuring the utility of rationales: whether or not having leakage in rationales could result in an accuracy difference of at least 5%. 19 BLEU↑ METEOR↑ ROUGE-1↑ ROUGE-2↑ ROUGE-L↑ TED-R↓ (H=2) TED-E↓ (H=2) ParaNMT -Small Direct 10.8 26.2 44.2 18.6 44.9 1.4 1.5 Ctrl 14.3 30.7 51.4 25.8 50.7 1.3 1.2 Syntax exp. 13.6 27.3 46.4 20.2 47.0 1.4 1.4 AESOP 22.9 32.7 54.4 29.8 56.4 0.9 0.5 QQPPos Direct 6.7 25.2 39.8 15.6 41.5 1.8 1.8 Ctrl 10.5 25.6 43.0 19.8 45.2 1.4 1.4 Syntax exp. 9.0 26.5 42.8 17.8 14.2 1.8 1.8 AESOP 47.3 49.7 73.3 54.1 75.6 0.4 0.3 Table 2.6: Performance comparison with ground-truth syntactic control for AESOP [141] and fine-shot ChatGPT. With coarse syntactic control from a shallow height of pruning, AESOP, the state of the finetuned small model, outperforms five-shot ChatGPT across all semantic preservation (BLUE, ROUGE Scores, and METEOR) and syntactic conformation metrics (TED-R and TED-E at the height of two) by a large margin. ↑ means higher is better, while ↓ means lower is better. By comparing ctrl with syntax explanation, we show that ChatGPT is better at mimicking the syntactic structure from an exemplar than utilizing the syntactic information directly from the syntax. 2.1.6 Controlled Paraphrase Generation Task Description. Syntactically-controlled paraphrase generation can benefit a wide range of NLP applications such as dialogue generation [37], improving the robustness of models [52] or metrics [1], and diversifying other generation tasks such as diverse question generation. Syntactically-controlled paraphrase generation is challenging because it requires satisfying two folds of control signals: semantic preservation and syntactic conformation. By definition of paraphrases, the generation should have exactly the same semantics as the input text. With syntax as part of the input, generated paraphrases should also conform with the indicated syntax. The input syntax can come from a variety of sources. Datasets. We evaluate on ParaNMT-small [18], derived from ParaNMT [158], and QQP-Pos [72]. Our train/dev/test split follows previous works [72, 141]. Each instance is a tuple of {source sentence, exemplar, ground-truth paraphrase}, where the exemplar shares the same syntax with the ground-truth paraphrase. 20 Evaluation Metrics. We use two sets of evaluation metrics to evaluate the quality of generated paraphrases. We use lexical-overlapping-based scores to evaluate the semantic preservation and tree-edit distances to evaluate the syntactic conformation. For lexical-overlapping-based scores, the higher is better. For tree edit distance, the lower is better, indicating that the newly derived syntax matches more closely with the expected syntax. In this work, we prune the constituency parse trees at a level of 2 and only compare the high-level syntactic structure. TED-R means the tree edit distance between the candidategenerated sentence with the ground-truth paraphrase as the reference. TED-E compares the candidate sentence against the exemplar that only provides the syntax. Evaluate with LLMs. We provide three ways to prompt for the controlled paraphrase generation: • Direct. We prompt LLMs directly without specifying any constraints. The prompt is written as Paraphrase {source sentence}. Please only have the paraphrase in the response. • Control. Under this mode, we use the exemplar sentence for the syntactic control signal. The prompt is written as Paraphrase “{source sentence}” so that it uses the syntactic structure from “{exemplar}”; please only have the paraphrase in the response. We observe that under the Control mode, the generated paraphrases would sometimes take the syntactic information from the exemplars and the semantic meaning from exemplar sentences. To solve this, we introduce the third mode Control with syntax explanation. We first extract the constituency parse structure from the exemplar sentence using Stanford CoreNLP, prune the parse tree at the height of two (i.e., parse at H2), and then ask ChatGPT to generate a natural language explanation of the pruned syntactic parse, which we refer to as syntax explanation. The generated syntax explanation will be part of the input. • Control with Syntax Explanation. The prompt is written as Paraphrase “{source sentence}" so that the sentence has a syntactic structure of “{pruned syntax}". {generated explanation for the syntax.} Please only have the generated paraphrase, not its parse, in the response. 21 Pruned Parse at H=2 Explanation (ROOT (S (NP ) (VP ))) This represents a sentence structure with a noun phrase and a verb phrase as its constituents. (ROOT (FRAG (SBAR ) (. ))) This is a sentence with a fragment that includes a subordinate clause followed by a period. (ROOT (SBARQ (WHADVP ) (SQ ) (. ))) This sentence structure represents an interrogative sentence with a subord -inate clause before the main clause. (ROOT (SQ (VBP ) (RB ) (NP ) (VP ) (. ))) This is a parse tree for a sentence containing a main verb and its subject, with a possible adverb and complement structure. Table 2.7: Examples of generated explanations for pruned constituency parse trees by ChatGPT. Table 2.7 shows examples of generated explanations for constituency parse trees pruned at height two by ChatGPT. We prompt ChatGPT from zero shots to five shots for our experiments, find that ChatGPT’s performance peaks with five shots as expected, and compare the performance of five-shot ChatGPT with AESOP [141]. The backbone of AESOP is the BART-base model, a 140m-parameter model finetuned with specialized input and output format tailored for the controlled paraphrase generation task. To the best of our knowledge, AESOP remains the state-of-the-art paraphrase generation model on both ParaNMT-small and QQPPos datasets. Result. Table 2.6 shows the performance comparison between five-shot ChatGPT and AESOP. We show that AESOP surpasses ChatGPT across all evaluation metrics for both semantic preservation metrics (lexicaloverlapping based metrics including BLEU, ROUGE scores, and METEOR) and syntactic conformation metrics (TED-R and TED-E at the height of two). In addition, we find that ChatGPT’s performance is the best under the setting of Control, where we use exemplar sentences for control signals. Compared with the setting Control with syntax explanation, Table 2.6 shows that ChatGPT is good at mimicking syntactic structures from sentences instead of directly incorporating the syntactic parses. Besides ChatGPT, we also tried Alpaca [146] and LLaMA [148] on the controlled paraphrase generation task. However, they repeat 22 input sentences and struggle to generate meaningful content. Therefore, we do not include them here for comparison. 2.1.7 Conclusion We test the controllability of large language models on five tasks and ten benchmarks, including a numerical planning benchmark that is easy for humans while challenging for LLMs. From there, we draw a spectrum by comparing the performance between LLMs and smaller specialized models. LLMs are able to generate human-level rationales and conform with coarse control signals, such as sentiment, topic and keyword incorporation. However, they struggle at fine-grained hard constraints, such as numerical planning and paraphrase generations. We hope that our work can inspire downstream applications on when to adopt LLMs. For example, we find that LLMs are good at generating rationales, and these automatic rationales could be used to further boost LLMs’ performance through chain-of-thought reasoning. 2.2 Dialect-Robust Evaluation of Generated Text 3 2.2.1 Motivation and Contribution Most natural language generation (NLG) evaluation metrics compare a system output against a humanwritten reference. References are usually drawn from a relatively narrow range of linguistic styles. They often exclude varieties like Indian English or Iberian Portuguese, which are geographical dialects with millions of speakers. As a result, outputs in dialects that are not represented in the reference may score poorly, discouraging the development of systems to meet the needs of these language communities. Although contemporary metrics such as COMET [125] can be reference-free, they still rely on training data and rater pools that do not cover all dialects of interest, leading to a high number of out-of-domain dialects. The performance of evaluation metrics on these out-of-domain dialects has not been quantified. 3 Please refer to the published work [62]. 23 We define a dialect-robust evaluation metric as one that produces the same score for system outputs that share the same semantics, but are expressed in different dialects. To understand whether current evaluation metrics are dialect-robust, we propose to quantify the dialect robustness at the dialect featurelevel and sentence-level. The analyses measure the dialect-sensitivity of evaluation metrics by comparing semantics-preserving dialect edits to perturbations that change the meaning of sentences. Through our analyses, we demonstrate that multiple state-of-the-art NLG evaluation metrics are not robust to dialects of Mandarin, English, and Portuguese. In many cases, system outputs that are perturbed so as to differ semantically from the reference score higher than outputs in which the only change is to the dialect. With the goal of increasing the dialect robustness and without performance degradation on standard benchmarks, we propose a training schema NANO. NANO is an unsupervised pretraining step to a metric that distills dialect information of the multilingual pretraining dataset into a model, which we demonstrate leads to improved dialect robustness. Based on our findings, we lay out research goals toward dialect-inclusive metrics. Moving beyond dialect robustness, we formalize the goal of dialect awareness, in which metrics can be applied to any user-specified language and dialect regardless of the language of the reference or source document. 2.2.2 Definition of Dialect Robustness Dialects can be regarded as linguistic subdivisions that align with communities of speakers, often grouped by geographical or demographic attributes [17]. A classic example is nation-level varieties, such as Brazilian and Iberian Portuguese. Dialects are distinguished from each other by a set of dialect features, which can operate at the levels of pronunciation, lexicon, rhetorical devices, and grammar [155]; one working definition of dialect is as a set of correlated features [99]. Two examples of dialect features are shown in Figure 2.3. The left side shows the English dialect feature “focus only”, which distinguishes Indian English from other varieties, such as US English [75]. The 24 feature changes the surface form but not the underlying semantics. The right panel of Figure 2.3 shows the Portuguese dialect feature of different lexical choice for the same semantics (“breakfast”), which distinguishes Iberian Portuguese from Brazilian Portuguese. Many dialect features are acceptable in multiple dialects: for example, zero definite article (“∅ main reason is . . . ”)4 is used in Indian English, Singapore English, and several other post-colonial dialects. Dialect Robustness Consider a translation system that produces Iberian Portuguese outputs at a task where it is desirable to generate text in a variety of dialects. If all the training data for the metric used to evaluate generation quality comes from Brazilian Portuguese, it will likely assign a lower score to Iberian Portuguese outputs, thereby misrepresenting system quality and disincentivizing further development of the more diverse system in favor of one that only produces Brazilian Portuguese. To be able to measure this effect, we define dialect robustness in the context of NLG evaluation as: Definition 1 (Dialect robustness). Let y (d) and y (d ′ ) be two system outputs that are semantically equivalent but written in different dialects. An evaluation metric m : Y → R is dialect robust iff m(y (d) ) = m(y (d ′ ) ) for all such (y (d) , y(d ′ ) ). 5 This definition is strict: it would not apply to any system that produced even small differences in score between semantically equivalent, regionally distinct outputs. For that reason, we propose a relaxed criterion, which compares the change in the metric induced by dialect to changes induced by semantic perturbations: 4 https://ewave-atlas.org/parameters/62#2/7.0/7.9 5 For simplicity we do not include the reference in this definition. A corpus-level reference-based metric could be defined as 1 N P i mi(yi) with mi(yi) = δ(yi, ri), with ri indicating the reference for example i and δ : Y × Y → R. Similarly, a corpus-level quality estimation metric could be defined with mi(yi) = δ(yi, xi) with xi indicating the input, such as the source language or passage to be summarized. For the corpus-level metric to be dialect robust (or ϕ-robust), all mi must be dialect robust (or ϕ-robust). 25 Figure 2.3: An illustration of dialect robustness in the context of generation evaluation. We define dialect robustness as evaluation metrics that are expected to have the same output across dialects that share the same semantics. Dialect edits (highlighted in yellow) should not lead to a greater degradation of score than edits that change the underlying semantics (highlighted in underline). BLEURT-20 in the figure assigns higher score to semantically-perturbed sentences than sentences with dialect features, exposing its vulnerability to dialects. Definition 2 (ϕ-Dialect robustness). Let y (d) and y (d ′ ) be two semantically-equivalent system outputs that differ in dialect. Let ϕ : Y → Y∗ be a semantic perturbation function that maps an input to a set of outputs whose semantics are different from the input. An evaluation metric m : Y → R is ϕ-dialect robust if m(y (d) , y(d ′ ) ) > m(y (d) , y˜) for all semantically-equivalent (y (d) , y(d ′ ) ) and all y˜ ∈ ϕ(y (d) ). Figure 2.3 illustrates the concepts of dialect robustness and dialect awareness. The top two rows of each panel vary only by dialect; the bottom row shows semantic perturbations of the top row. ϕ-dialect robustness implies that the top row is scored as more similar to the middle row than to the bottom row. Dialect awareness implies that the quality of the surface form in each row should be highest when paired with the correct dialect label. Is Semantic Equivalence Realistic? The above definitions presume that it is possible to characterize utterances in different dialects as semantically equivalent. Such characterizations have been criticized as lacking a strong foundation for semantic equivalence, outside the limited case in which the dialect differences are purely phonological [77, 127]. One such criticism is that a pair of utterances might be semantically equivalent for some communicative purposes, but not for others. To avoid the gray area between dialect differences that change semantics and those that do not, we design perturbations that have a small surface-level impact on the original utterance but a strong effect on its meaning, e.g. by 26 negating the main proposition or changing an important semantic argument. This establishes a necessary but not sufficient condition for dialect robustness: if a metric scores such perturbations more highly than dialect pairs, then it is certainly not dialect robust. Proving that a metric is dialect robust is more difficult, because it requires constructing more subtle semantic perturbations that are harder to distinguish (even conceptually) from dialect variables. Furthermore, from a practical standpoint we cannot evaluate y (d) with respect to all semantic perturbations y˜ ∈ ϕ(y (d) ), but the existence of perturbations for which m(y (d) , y˜) > m(y (d) , y(d ′ ) ) is enough to disprove dialect robustness. 2.2.3 Existing Metrics To assess the quality of a generated text, most automatic evaluation approaches compare it to a “ground truth” reference, with higher similarity to the reference implying higher-quality output [16]. Similarity can be based on lexical features or distributed representations. When distributed representations are used, they may be unsupervised [168] or fine-tuned on a corpus of human ratings. In addition to these similaritybased metrics, there are also reference-free metrics for quality estimation (e.g., COMET-QE), which we discuss in §2.2.5.2. Lexical Evaluation Metrics Many evaluation metrics including BLEU [102] and chrF [109] use lexical features such as n-gram overlap to measure similarity and remain popular evaluation metrics because of their lightweight and fast computation. However, due to their reliance on surface forms, BLEU and chrF have limited robustness to superficial syntactic differences between system outputs and references [133]. As dialects inherently include lexical variables, traditional evaluation metrics based on lexical overlap are expected to not perform well in terms of dialect robustness. 27 Distributed Evaluation Metrics Moving beyond surface forms, recent advances such as BLEURT [110],6 and COMET leverage the representations from models that are trained on human ratings. BLEURT pretrains RemBERT [23] on augmented data from Wikipedia and then finetunes on human ratings from WMT corpora. COMET is trained on the mixture of WMT and another two corpora, QT21 [QT21] and MQM [36] which both rely on machine translated outputs. Prism is trained on generated paraphrases from a mixture of data resources in 39 languages and does not require human ratings during training. YiSi directly utilizes the multilingual representation from multilingual BERT [28] for scoring. In summary, existing learned metrics either utilize the multilingual representation from pretrained models, or create multilingual training data through various augmentation strategies. However, none of them explicitly accounts for dialectal variations during training. 2.2.4 Testing Dialect Robustness In this section, we describe our methodology for assessing dialect robustness. We first introduce two ways to perturb sentences to get two comparable metrics’ outputs and then describe the statistical tests we use to aggregate the outputs over a corpus. 2.2.4.1 Micro-level Dialect Features Dialect features are local edits that distinguish dialects while avoiding changes to the meaning of the text; an orthographic example is the spelling of the suffix “-or” vs “-our”, which distinguishes U.S. vs U.K. English. Our first robustness assessment uses such features. We start with a base sentence y (base) i taken from a corpus of sentences D = {y1, . . . , yn}. We further assume access to a version of the same sentence in which a dialect feature was introduced, denoted y (dialect) i . Following Definition 2, we introduce a semantic perturbation that changes y (base) i to y (perturb) i . Again using English as an example, from the US 6We use BLEURT-20 checkpoint from [110] different from the original BLEURT [133]. 28 English base sentence “as recently as April. . .”, we may produce the Indian English version “recently only in April. . .” (using the feature focus-only), and the semantic perturbation “as recently as May. . .”. Let m(yi , yj ) be a metric function that takes a candidate sentence yi and a reference yj as input, and produces a score Given the above defined variations of yi , we define the dialect and perturbed scores as m,i = m(y (dialect) i , y (base) i ) (2.1) m,i = m(y (perturb) i , y (base) i ). (2.2) To satisfy Definition 2, m,i should score higher than m,i across the sentences in the corpus. This implies, as a necessary but not sufficient condition, that Ei∼D[m,i] > Ei∼D[m,i]. We consider three perturbation types: deletion, replacement and insertion. Each perturbation aims to change the sentence by only a single word or phrase, so as to induce a strong semantic change with a minimal impact to the surface form. Such perturbations are expected to yield challenging but clear comparisons against dialect variation. There are no standard techniques for introducing semantic perturbations, so we apply fewshot-learning by prompting LaMDA [26]. For each perturbation type, we provide five exemplars and then prompt LaMDA for automatic semantic perturbation given a sentence y (en-base) i . Some sentences are not amenable to all perturbations — for example, some are too short to support deletion — so we choose one perturbation per sentence, with the preference order of replacement, insertion and then deletion, determined by the success rate of having a different sentence as output. 29 2.2.4.2 Sentence-level Dialect Rewrites Micro-level dialect features require significant linguistic expertise to identify and have been defined for only a few languages. We thus introduce a less granular method that is based on parallel human translations. Given an English base sentence eni , we obtain human translations y (j) i and y (k) i in dialects j and k of the target language, e.g., Brazilian and Iberian Portuguese. We can again use the metric m to score the pair m,i = m(y (j) i , y (k) i ). (2.3) Because we have access to the English base sentence, we can use machine translation to generate a sentence in the target language eni ==⇒ MT yˆ (j ∗) i which we can compare to, such that m,i = m(y (j) i , yˆ (j ∗) i ). (2.4) Here, j ∗ indicates the locale that we believe is most strongly targeted by the machine translation system (“pt-BR” for Portuguese, “zh-CN” for Mandarin). Finally, we construct target language perturbations by first perturbing the English source and then automatically translating:7 eni =======⇒ perturbation en˜ i ==⇒ MT y˜ (j ∗) i (2.5) m,i = m(y (j) i , y˜ (j ∗) i ). (2.6) The perturbations are produced by prompting LaMDA with the same exemplars as in §2.2.4.1. 7While it is possible directly perturb the sentences in the target language, using the same English validated few-shot setup scales to more languages at the cost of a more English-centric perturbation style. 30 We expect E[m] > E[m], because both involve machine translation, but the latter also involves perturbation to the source. If we have E[m] > E[m] then metric m strongly disprefers dialect variants, even in favor of inputs that are different in meaning due to the perturbation of the English source. 2.2.4.3 Statistical Methods As a necessary condition for dialect robustness, we test whether the expected scores for dialect rewrites exceed the expected scores for semantic perturbations. A challenge in correctly characterizing the uncertainty of these comparisons is that there is a substantial amount of variance over the original examples y (base) i . We handle this with two styles of analysis: Mixed-effect Regression For metric m, example i, and condition j ∈ {perturb, dialect, MT}, we model the metric σ (j) m,i via a mixed-effects regression [7, 137], σ (j) i = θi + ϕj + ϵi,j , (2.7) with the subscript m implicit in each term. The first term θi is a random intercept associated with example i, which helps to address the variance across examples; ϕj , the parameter of interest, is a fixed effect associated with the condition j; ϵi,j is a Gaussian error. Because all methods and conditions are applied to all examples, the predictors are uncorrelated. This makes it possible to interpret ϕm,j as an estimate of the expected change in the metric value corresponding to the application of metric m in condition j. By including the θi term, the regression is conceptually equivalent to a pairwise comparison, in the sense that the regression also benefits from the additional power obtained by controlling for per-example variation. Win/loss Analysis and Binomial Test For a coarse-grained evaluation that is more easily comparable across metrics, we count how often each condition j receives a higher score than condition k in a pairwise 31 comparison. When j represents dialect rewrites and k represents semantic perturbations, a high win rate indicates that the metric is more likely to be dialect robust. To measure statistical significance, we apply a one-tailed binomial test, which computes the likelihood of achieving at least n wins on T trials given a null hypothesis win probability 1 2 . In words, we test against the null hypothesis that for each example, a dialect rewrite and a semantic perturbation are equally likely to get the higher score. As discussed in the next section, we perform multiple comparisons per metric, across different conditions and different languages. To adjust the p-values for multiple comparisons, we apply the Bonferroni correction [29]. 2.2.5 NANO We hypothesize that explicitly encoding dialect information while pretraining a model will lead to an improved downstream robustness. To test this hypothesis on learned metrics for text generation, we introduce NANO,8 a model-agnostic pretraining schema with the goal of improving dialect robustness without performance degradation on downstream metric benchmarks. 2.2.5.1 Acceptability Pretraining Given a pretrained model, we add a second pretraining phase to distill dialect information into the model. Specifically, we define the NANO-task as, given an expression yd in dialect d which is part of a language L, identify whether yd is acceptable in a given dialect d ′ or language L ′ . Data To construct a training corpus for NANO, we process mC4 [161]. We split the corpus into sentences and use a Language Identification (LangID) model [169] by [9] to identify the language and locale information for the sentences.9 Besides LangID output, mC4 provides the URL where a sentence originated from which we extract the region information as an indication of geographic dialect. For Portuguese and 8The name is motivated by the dialect feature “invariant tag (‘isn’t it’, ‘no’, ‘na’)” [75]. 9We use a more current model that is capable of identifying the locale for Mandarin and Portuguese. 32 Mandarin, we filter an instance if the predicted locale does not agree with the region information from the URL. For other languages, we combine the LangID and region information as a noisy approximation for a dialect of the language in the specific region. For example, if the LangID model predicts that the language is English and the region in the URL indicates India (.in), we treat the instance as en-IN.10 We compare three pretraining settings with an increasing noise: 1) Mandarin and Portuguese only; 2) Mandarin, Portuguese and selected English dialects and 3) ten languages with metric finetuning data evaluated during the WMT benchmark with ninety-five language variants following the classification by [30]. Given a sentence, we balance the ratio of sampling a dialect or language tag using a parameter λ. For instance, a sentence with gold dialect tag “pt-BR” can be a positive instance for the dialect itself or the general language “pt-any”. At the same time, it can also be a negative instance for other dialect (e.g., “en-IN”) or language (“en-any”). The ratio of positive instances versus negatives instances is always 0.5. Modeling We use mT5 [161] as our base model because the model is pretrained on the mC4 dataset, matching with our corpus choice and ensuring tokenizer compatibility. During pretraining, we transform each sentence into the string candidate: {sentence} language: {language_tag}, where the language_tag can be the dialect or the general language tag. The target label is zero or one, indicating whether the sentence belongs to the language tag. We adapt the Encoder-Decoder architecture of mT5 for regression by taking the logits of the first decoded token and applying the RMSE loss function between the logits and the label during model training. 2.2.5.2 Finetuning Following the setup by [110], we use the test data from the WMT shared task from 2015 to 2019 as training data and use the WMT shared task 2020 as test data. Among previous works, BLEURT [110] and YiSi [89] are trained to measure the semantic similarity between candidate and reference within the same language. 10This is a noisy approximation because many dialects do not align with national borders. The development of a data-gathering approach for subnational and transnational dialects is an important topic for future work. 33 EN PT ZH All 148 2616 2227 Replace 96 962 866 Insert 89 550 528 Delete 63 693 614 Agg. 115 1415 1252 Table 2.8: Number of evaluation examples per language before and after semantic perturbation. The middle three rows are the number of examples to which each perturbation was applicable, and the final row Agg. is the number of examples to which at least one perturbation is applicable, which we use in our final analysis. COMET, on the other hand, supports the cross-language quality estimation of a machine translation with or without reference, but does not support within-language assessment. 2.2.6 Experiments In this section, we demonstrate that existing metrics are not dialect robust by applying our proposed methods and statistical tests to existing corpora in English, Portuguese, and Mandarin. We show that language-aware pretraining via NANO improves the dialect robustness and leads to promising preliminary steps toward dialect-aware metrics. Furthermore, we present evidence that language-aware pretraining can improve the metric performance on the WMT benchmark and that the method successfully transfers to other evaluation setups like quality estimation. Datasets As described in §4.2.2, we consider micro-level and sentence-level dialect rewrites. The microlevel rewrites are based on pairwise data from [27], in which each example includes a form containing at least one dialect feature from Indian English and a corresponding “base sentence” in U.S. English. We then apply the semantic perturbation to the base sentence as described in §2.2.4.1. For each perturbation type, one of the coauthors manually examined whether the perturbation successfully changes the meaning of the sentence. If all of the three perturbations fail, we exclude the instance from analysis.11 11For the sentences that have multiple dialect rewritings, we treat each one as an individual data point. When multiple semantic perturbations can be applied, we choose a single one, preferring replacements, then insertions, and then deletions. 34 en pt zh lang 0.00 0.05 0.10 0.15 0.20 Coef. +NANOXXL en pt zh lang 0.000 0.025 0.050 0.075 0.100 BLEURT en pt zh lang 0.0 0.1 0.2 0.3 0.4 PRISM Dialect vs. Semantic Perturbation MT vs. Semantic Perturbation en pt zh lang 0.06 0.04 0.02 0.00 0.02 0.04 YiSi en pt zh lang 10 0 10 BLEU en pt zh lang 0.05 0.00 0.05 CHRF Figure 2.4: Coefficients from the regression model for Dialect vs. Semantic Perturbation (ϕdialect vs. perturb) and MT vs. Semantic Perturbation (ϕMT vs. perturb). The higher ϕdialect vs. perturb is, the more dialect-robust a metric is. Error bars show 99% confidence intervals; they are larger for the English evaluations because there is less data. ϕMT vs. perturb serves as a stress test to measure evaluation metrics’ abilities of recognizing semantic changes. We show that evaluation metrics are good at recognizing semantic changes but not dialect changes. For all metrics except BLEURT and NANO, ϕdialect − ϕperturb is negative for at least one language, exposing their vulnerability to dialects. Learned Lexical mT5base mT5XL mT5XXL BLEURT Prism YiSi BLEU chrF -NANO +NANO -NANO +NANO -NANO +NANO EN 0.53 0.51 0.53 0.49 0.46 0.50 0.50 0.55 0.54 0.57 0.57 PT 0.59 0.53 0.36 0.35 0.35 0.39 0.44 0.57 0.65 0.82 0.81 ZH 0.59 0.47 0.46 0.35 0.36 0.46 0.45 0.51 0.59 0.74 0.74 Table 2.9: Success Rates of > . Training with NANO starts to improve upon the strongest baseline BLEURT with mT5XL and achieves the best performance with mT5XXL. We boldface the success rates that are better than random chance (0.5) and significant after applying Bonferroni correction for multiple comparisons. Training with NANO improves dialect robustness for the XL- and base-scale model. 35 EN PT ZH mT5base -NANO 0.010.01 -0.020.00 -0.020.00 +NANO 0.040.01 -0.01 0.00 0.000.00 mT5XL -NANO 0.010.01 0.020.00 0.020.00 +NANO 0.060.01 0.050.00 0.050.00 mT5XXL -NANO 0.150.02 0.120.00 0.110.00 +NANO 0.190.02 0.150.00 0.130.00 Table 2.10: Coefficients from the regression model for Dialect vs. Semantic Perturbation, indicating the dialect robustness, before and after using NANO. We boldface significant coefficients where NANO helps. We show that training with NANO improves the dialect robustness across all model sizes and languages. For sentence-level dialect analysis, we use the test set of the FRMT benchmark [126]. Each instance contains an English sentence and its translations into dialects of the target languages Portuguese and Mandarin. For Portuguese, the two dialects are Brazilian Portuguese (pt-BR) and European Portuguese (pt-PT); for Mandarin, we consider mainland Mandarin and Taiwanese Mandarin, both in simplified script. As described in §2.2.4.2, semantic perturbations are obtained by perturbing the English sentences and then translating, using the Google Translate API. Table 2.8 shows the number of evaluation examples. 2.2.7 Sensitivity to Dialects We use the statistical methods reported in §2.2.4.3 to test metrics’ sensitivity to dialects. Regression Following Equation 2.7, we use m,i, m,i, m,i as conditions and model each metric as a mixedeffects regression. For a dialect-robust metric, we expect ϕdialect > ϕperturb, indicating that dialect rewrites score more highly than semantic perturbations, as required by definition 2. The difference ϕdialect−ϕperturb is shown in the Y -axis of Figure 2.4. We also evaluate ϕMT − ϕperturb as a stress test to measure metrics’ abilities to recognize semantic changes, and to ensure that the semantic perturbations are effective. For all metrics except BLEURT and NANO, ϕdialect − ϕperturb is negative for at least one language, indicating that these metrics are not dialect robust even in the average case. At the same time, all evaluation metrics can distinguish the mt and perturb conditions, showing that the issue is specific to dialect and not generally 36 applicable to other paraphrases. Table 2.10 shows the coefficients before and after using NANO, which improves dialect robustness across all model sizes and languages. Success Rates In Table 2.9 we report the success rates of a metric in assigning higher scores to dialect rewrites than to semantic perturbations. BLEURT performs better than other existing evaluation metrics which consistently fail to rank the dialect change above the perturbations. However, no metric correctly ranks the English examples at better than a random chance win rate (0.5), and even BLEURT as the most robust metric only has a 0.59 win rate for PT and ZH. In comparison with BLEURT, NANO achieves a higher win rate when scaled to XL and XXL. The same trend can be observed in the regression analysis, where NANO’s coefficients are positive for all metrics and languages. However, the marginal benefit of NANO over simply finetuning a larger model diminishes at scale—while NANO leads to significant improvements at XL scale, it has only a minor effect on the XXL-sized model. 2.2.8 Conclusion We introduce and formalize the dialect robustness and dialect awareness in the context of generation evaluation. Grounded by a suite of statistical tests, we find that existing evaluation methods are not robust to dialects. As a first step toward a solution to this problem, we propose NANO as a pretraining strategy. Our experiments demonstrate that NANO offers a size-efficient way to improve both the dialect robustness and improves the metric performance of metrics on WMT benchmark. Due to the limited availability of dialect-parallel corpora, our robustness tests are conducted in thousands of examples for Mandarin and Portuguese and hundreds of examples for English, which is insufficient to capture the full extent of these languages. We encourage future work to develop more resources, including benchmarks and corpora to conduct research on dialects for NLG evaluation. Our encouraging preliminary results lead us to urge researchers to consider and improve the dialect diversity during pretraining. 37 Chapter 3 Correlation Between Model Performance and Data Quality As other deep learning methods, large language models are also black boxes. More recently, companies such as OpenAI, the birthplace of ChatGPT, started the trend of serving large language models as web services. These closed-source model developing practices make it impossible for academic researchers to look into some failure modes from the model architecture side. In our study on investigating the benefits of free-form rationales for rather smaller models, we showcase that better data quality not only leads to better human interpretability, but also better model utility (Section 3.1). On the other hand, if the training data such as Wikipedia contains gender bias, such biases may propagate to large language models on downstream tasks (Section 3.2). 3.1 Investigating the Benefits of Free-Form Rationales 1 3.1.1 Motivation and Contribution Free-form rationales designed to explain decisions by providing additional world knowledge or commonsense reasoning, are key for interpretability [70, 85, 5] in natural language processing tasks.2 Free-form rationales come with the promise of being easily interpretable by humans, in contrast to other kinds of explanations, such as extractive rationales in the form of textual highlights [14, 79], or low-level neuron 1 Please refer to the published work [59]. 2We use the terms “rationale” and “explanation” interchangeably. Please see [156] and [54] for more details on terminology. 38 Question Options Where would you find a monkey in the wild? zoo, barrel, research laboratory, captivity, thailand thailand find a monkey in the wild generated ECQA thailand - Wikipedia Thailand has a lot of wild areas with monkeys. All the other options are incorrect as they are not wild areas. All the other options are incorrect as they are not a wild place. In thailand, monkeys can be found in wild. crowd sourced Vs. Vs. crowdsourced generated Cos-E leaks answer? has background knowledge? has background knowledge? leaks answer? Figure 3.1: Illustration of our investigation into free-form rationales for commonsense QA from CoS-E [150] and ECQA [2]. We conduct human studies to understand perceived usefulness of rationales, by asking if they contain background knowledge necessary to answer a question (yellow highlights). We also investigate if rationales leak the answer to models that use them as additional training signals. Our work compare rationales from different sources, and finds that ECQA rationales are preferable to CoS-E rationales on various axes. Finally, we find that crowdsourced rationales also offer greater benefits to both humans and models than generated rationales. activations in neural architectures [46]. Indeed, there have been increasing efforts to collect corpora containing free-form rationales for task instances, which provide a supervised setting for teaching models to produce rationales for test-time decisions. Such corpora include CoS-E [118] and ECQA [2] for commonsense question-answering, e-SNLI [14] for natural language inference, SBIC [132] for social bias inference, among others. However, the benefits of rationales remain unclear. Do crowdsourced rationales really help human users interpret decisions better, or do they simply provide the right answer without the necessary background knowledge or reasoning? Our work explores this question through two carefully designed human studies. We find that rationales from different corpora have different capabilities: humans find 93% of ECQA rationales provide additional information that can help answer questions, while only 12% of CoS-E rationales do. 39 Inspired by this finding, we further ask: analogous to the benefit to human users, can crowdsourced rationales also benefit models by providing an additional training signal to boost performance? In contrast to prior work that uses rationales as supervision to generate model rationales, we focus on using crowdsourced rationales to simply aid a task model’s classification capabilities. Our results indicate that while crowdsourced rationales do indeed boost model performance, they might be doing so trivially, i.e. by simply leaking the correct answer to the model. In response, we experiment with different strategies for altering ECQA and CoS-E rationales to prevent such leakage, and set up a fair test benchmark. We find that, even without leakage, rationales with background knowledge are helpful: including only 5% of high-quality rationales during training can improve model performance by 47.22% at inference time. Meanwhile, rationales that are perceived higher quality by humans would bring a better gain for models too. 3.1.2 Preliminaries Tasks and Datasets. We explore three large datasets containing crowdsourced free-form natural language rationales. The first two, CoS-E [118] and ECQA [2], address commonsense-based question answering (ComQA). The ComQA task is based on answering questions about common situations, from a choice of 3 (CoS-E v1.0) or 5 (CoS-E v1.11) answers, along with providing a free-text explanation for the correct answer.3 ECQA builds upon and improves the quality of CoS-E v1.11 explanations, in terms of comprehensiveness, refutation completeness and non-redundancy [2]. In addition, ECQA explanations are contrastive, i.e. they include rationales for choosing the correct option and rejecting other options. We additionally consider an open-domain reasoning task about textual qualitative relationships, via the QuaRTz [144] dataset, for a subset of our experiments. In this task, each instance contains a triplet: a situated qualitative question, two answer options and a knowledge statement that can help answer the 3CoS-E does not provide explanations for instances in the test set; we report our results on its validation set. 40 Source Rationale CoS-E v1.11 People waiting alongside with when you’re in a reception area ECQA People waits in a reception area. You cant wait along with a motel, hotel, chair or a hospital. These are the people where the reception area is found but people waits together at reception area of such places. ECQAshuffle You cant wait along with a motel, hotel, chair or a hospital. These are the people where the reception area is found but people waits together at reception area of such places. People waits in a reception area. Table 3.1: Example annotations from CoS-E v1.11 and ECQA for the question “What are you waiting alongside with when you’re in a reception area?” with options 1: motel 2: chair 3: hospital 4: people 5: hotels and the correct option people. CoS-E annotation directly combines the question and the correct answer, while ECQA annotation provides additional background knowledge. question. For example, for “Compared to a box of bricks a box of feathers would be (A) lighter (B) heavier”, the annotated knowledge in QuaRTz is A given volume of a denser substance is heavier than the same volume of a less dense substance. In contrast to CoS-E and ECQA, the two options for a question in QuaRTz are orthogonal, which means the knowledge provided to support one option will automatically reject the other option. Furthermore, this general qualitative knowledge statement in QuaRTz is guaranteed to not leak the correct answer. While not explicitly designed for interpretability, we treat the annotated knowledge in QuaRTz as a rationale that can help understand or derive the correct answer. 3.1.3 Do crowdsourced rationales aid human interpretability? Free-text rationales purportedly improve human user interpretability by explaining a model’s decisions in natural language. We seek to discover which characteristics of the rationales aid users: Q1 Do rationales provide additional background knowledge for understanding decisions? E.g., the rationale: ‘Air cannot stay in any object that has a hole in it’ provides additional knowledge for understanding why the answer to ‘What would not be true about a basketball if it had a hole in it but it did not lose its general shape?’ should be ‘full of air’. 41 Q2 Do rationales provide explicit clues to leak the correct answer? For ComQA, this might initially seem like a helpful rationale, without really being so.4 E.g., given a rationale: ‘Mexico is one of the largest coffee production country.’, one can guess the correct answer should be ‘mexico’, when given the options ‘mildred’s coffee shop’, ‘mexico’, ‘diner’, kitchen’ or ‘canteen’, without looking at the question ‘In what Spanish speaking North American country can you get a great cup of coffee?’. 3.1.4 Preliminary Studies We investigate Q1 and Q2 via a direct assessment (§3.1.4.1) by human raters, as well as via proxy questions offering an indirect assessment (§3.1.4.2) by the raters. 3.1.4.1 Direct Assessment We conduct a pilot study where given the question, options, correct answer and rationales from CoS-E and ECQA for a ComQA instance, annotators are tasked to directly answer which rationale provides additional background knowledge that can help them answer the question. Four options are possible: CoS-E, ECQA, neither, or both. 5 Simultaneously, we ask annotators if any of the two rationales leaks the correct answer. Concretely, the annotators are required to provide three annotations for each instance: • choose one option for the additional background information (T1); • judge if ECQA rationale leaks the correct answer (T2); • judge if CoS-E leaks the correct answer (T3). We conduct our study on the first 120 rationales in ECQA and CoS-E v1.11 test set via the Amazon Mechanical Turk platform. For each instance, we collect annotations from three independent annotators. 4While leakage does not reduce the utility of a rationale for human interpretability, it does have implications for utility as model supervision, as we will see in subsequent sections, §4.2.4. 5While [2] provide similar human studies comparing ECQA and CoS-E rationales, they do not specifically ask for additional background knowledge. 42 RECQA RCoS-E both neither Q1: has bg. knowl.? 65.0% 9.2% 20.8% 5.0% Q2: leaks answer? 83.3% 43.3% n/a n/a Table 3.2: Human study directly comparing ECQA and CoS-E rationales on 120 ComQA instances, for the presence of background knowledge, and answer leakage. question options Rcrowd Rconstructed CoS-E Where can a human find clothes that aren’t pants? pants shop, on planet earth, dress shop, school, train wreck dress shop can a human find clothes that aren’t pants. A human can find clothes at dress shop that aren’t pants. ECQA Where do adults use glue sticks? classroom, desk drawer, at school, office, kitchendrawer Glue stick is a solid glue used to stick thin paper materials by adults in offices. Adults don’t go to classroom and school, and other options don’t have adults. Adults use glue sticks in their offices. They do not use them at classroom, desk drawer, at school or kitchen drawer. Table 3.3: Examples of crowdsourced rationales for CoS-E and ECQA, vs. our manually constructed rationales that declaratively combine the question and the answer without providing any background knowledge or commonsense reasoning. Using Fleiss’s Kappa [35], the inter annotator agreement (IAA) for T1, T2 and T3 are 0.43, 0.26, and 0.30, respectively, indicating moderate agreement. We take the majority vote as the final label. Table 3.2 shows the results of our human evaluation. We see 85.8% of ECQA rationales provide additional background knowledge to help answer the question, while only 30.0% of CoS-E rationales do the same, indicating greater usefulness of ECQA rationales for human interpretability. Both ECQA and CoS-E rationales leak the correct answers. Indeed, most ECQA rationales provide some background knowledge necessary for humans to understand the decision, while also revealing the correct answer; the same does not hold for CoS-E. 3.1.4.2 Indirect Assessment While the previous study asked participants to directly assess the background knowledge of individual rationales, we design two other studies below that use a proxy to extract a human assessment of rationale utility [145], for Q2 and Q1, respectively. Here, we randomly sample 100 ComQA instances from the test set. 43 Rcrowd Rconstructed neither either CoS-E 3.0% 5.0% 92.0% 0.0% ECQA 73.0% 9.0% 14.0% 4.0% Table 3.4: Results from our human study via indirect assessment to compare 100 pairs of crowdsourced and constructed rationales. The IAA is 0.61. For Q2, we ask annotators to guess the correct answer from all options, given only the crowdsourced rationales from CoS-E and ECQA; annotators can also opt for “cannot tell” based on the evidence. We hypothesise that this study will indirectly answer whether the rationale leaks the correct option, if the worker is able to guess correctly. Each instance is provided to three annotators, and we take a majority vote for their ratings. We find that annotators are able to pick the correct answer, given only the rationales (and not questions) in 43.0% of cases for CoS-E and 78.0% of cases for ECQA, with high agreement (IAA 0.73). This confirms our findings from the direct assessment in Table 3.2. For Q1, we manually construct rationales to contrast with crowdsourced rationales. Our constructed rationales are designed to simply combine the question and the correct answer, but not provide any additional background knowledge. If a human prefers the crowdsourced rationale, we can indirectly ascertain that it provides some background knowledge to help with human interpretability. For CoS-E, we form a constructed rationale for a question by rephrasing the question as a statement and inserting the correct option in place of the question word. For ECQA, in addition to the CoS-E-style constructed sentence, we add an additional sentence that rephrases the question as a negative statement, replaces some referents with pronoun anaphora, and inserts the incorrect options in place of the question word. We also try to ensure fluency and stylistic consistency with the crowdsourced explanations. We show two examples of our constructed rationales in Table 3.3. We provide human subjects with the question, the correct answer, the crowdsourced rationale (from CoS-E or ECQA) and our constructed rationale. We instruct workers to choose the explanation that they would prefer if they need to explain the correct answer to someone who might not have the necessary background knowledge to understand 44 Category Description Example Distribution Rno-leak-bg provides additional background knowledge without leaking correct answers. Question: What would not be true about a basketball if it had a hole in it but it did not lose its general shape? Options: 1: punctured 2: popular in america 3: full of air 4: gone 5: round Rationale: Air cannot stay in any object that has a hole in it. 4.83% (59/1221) Rleak-bg leaks the correct answer but contains additional background knowledge that can help answer questions. Question: In what Spanish speaking North American country can you get a great cup of coffee? Options: 1: mildred’s coffee shop 2: mexico 3: diner 4: kitchen 5: canteen Rationale: Mexico is one of the largest coffee production country. 6.72% (82/1221) Rno-leak-no-bg neither provides any additional background information, nor leaks the correct answer. Question: why would a person like to have a large house? Options: 1: have choice 2: mentally challenged 3: own house 4: obesity 5: lots of space Rationale: This word is most relevant 43.65% (533/1221) Rleak-no-bg leaks the correct answer and does not provide additional background knowledge. Question: where will a cheap book be found? Options: 1: bookstore 2: classroom 3: discount store 4: school room 5: bedside table Rationale: discount shop retail shop 44.80% (547/1221) Table 3.5: Our manual four-way categorization of CoS-E v1.11 (dev.) rationales, with examples. Bolded options indicate ground truth. We find that 88.45% of rationales do not provide additional background knowledge. given only the question and set of answer choices. Each instance is provided to three annotators, and we take a majority vote for their ratings. Results in Table 3.4 show that human raters overwhelmingly preferred neither our constructed rationales or CoS-E rationales, indicating that neither provides background knowledge necessary for answering the question. On the other hand, raters seem to prefer ECQA rationales over our constructions, indicating that the former might contain background knowledge owing to their rigorous annotation procedure [2]. Yet, surprisingly, raters picked our constructed rationales 9% of the time over ECQA, while being ambivalent about either rationale for 4% of the cases; moreover, they liked neither for 14% of the cases! This could indicate that some ECQA instances might not provide adequate background knowledge, and / or raters might at times choose simpler (though vacuous) rationales; future work might pursue studying such cases. 3.1.4.3 Categorizing Crowdsourced Rationales CoS-E. Although [98] criticize the quality of CoS-E rationales, CoS-E v1.11 is still widely used for commonsense reasoning [103], analysis [92, 157], and as an additional source of commonsense knowledge [162]. In order for the community to understand the deficiencies of the crowdsourced CoS-E rationales, we provide a detailed study of the same, which was missing in [98]. 45 Building on Q1 and Q2, we aim to categorize CoS-E rationales into 4 categories, to determine if these provided background knowledge and/or leaked the answer. One of the authors manually categorized the rationales in the development set of CoS-E v1.11 into four categories. To validate this categorization, three co-authors annotated a subset of 100 instances independently for the same categorization. We obtained an IAA Fleiss Kappa of 0.65 for background knowledge and 0.84 for leakage, indicating moderate / high agreement. For these 100 instances, we use the majority vote among the three annotators as the final label. Table 3.5 describes and shows the distribution of the categories, with examples from each picked at random. Rationales that do not provide additional background knowledge make up 88.45% of the entire development set of CoS-E v1.11. Using the development set as a lens, our annotation provides a qualitative and quantitative understanding of the crowdsourced CoS-E rationales. Future research should take into consideration these findings before using CoS-E rationales. ECQA. [2] build on CoS-E question-answer pairs and carefully collect detailed rationales. Table 3.1 compares CoS-E and ECQA rationales, where the former directly combines the correct answer and the question, but the latter contains additional commonsense knowledge that can help answer the question, suggesting a higher quality. Moreover, ECQA rationales are contrastive as they explain, for each option, why it is correct or incorrect. Regardless, we find that all ECQA rationales start by explaining the correct option, followed by other options. This ordering introduces a spurious correlation which likely provides a shortcut to a model for predicting the correct answer from the rationale, but for wrong reasons [39]. A random shuffle of the sentences within each ECQA rationale (last row; Table 3.1) can address this issue.6 3.1.5 Can Models Benefit from Crowdsourced Rationales? In §3.1.3, we found that crowdsourced rationales from carefully constructed corpora provide additional information to help humans better answer commonsense questions. Now, we seek to answer if these 6We use the Spacy sentencizer to split the rationale, and randomly permute sentence ordering, with seed 0. 46 c1 c2 c3 c4 c5 c6 c7 c8 Test → I I+RCoS-E I+RECQA I+RCoS-E (test subsets) %RTr w/o shuffle shuffled Rno-leak-bg Rleak-bg Rno-leak-no-bg Rleak-no-bg r1 I→O 0% 57.00 46.11 53.32 54.95 40.68 46.34 45.97 46.80 r2 IRTr CoS-E →O 5% 53.781.10 72.532.19 76.502.30 65.572.86 59.895.24 87.404.14 54.721.17 89.032.72 r3 10% 54.440.72 76.031.00 80.781.53 63.740.78 70.062.88 89.023.45 56.851.48 93.420.65 r4 20% 53.620.23 77.230.47 83.401.41 62.711.80 68.935.59 95.531.15 56.971.08 95.120.09 r5 30% 53.120.60 77.430.30 79.173.23 63.561.28 73.555.46 94.710.58 56.721.11 96.490.67 r6 100% 48.24 78.46 66.01 64.46 71.19 97.56 57.97 96.34 r7 IRTr ECQA-shf. →O 5% 54.050.95 59.270.91 86.651.10 86.351.54 51.417.10 69.101.52 53.220.49 64.532.35 r8 10% 54.051.08 61.722.11 92.550.52 93.010.37 54.805.24 72.764.03 52.531.46 69.773.16 r9 20% 53.290.32 66.500.66 95.410.48 94.701.17 64.414.99 83.741.15 55.850.69 74.531.36 r10 30% 52.850.67 65.050.78 95.850.34 95.520.51 56.505.24 81.302.30 52.910.41 75.382.15 r11 100% 38.08 67.32 97.3 96.56 55.93 93.90 39.40 91.77 Table 3.6: ComQA accuracies under various train (row) and test (column) settings. r1 is an I→O T5 baseline without access to rationales during training; the following rows use different amounts (%RTr) of CoS-E rationales (r2 − r6) and shuffled ECQA rationales (r7 − r11) for training IR→O T5 models. At inference time, each model predicts the label given no rationale (c1), or given the crowdsourced rationales for the entire test set (c2-c4), or a subset of the CoS-E test set (c5-c8), selected based on the rationale categories in Table 3.5. c4 and c3 report ECQA test set performance, when the test rationales are shuffled or not, respectively. We report accuracies averaged across 3 random seeds (stdev as subscript) for %R selection during training. rationales could also help in model learning, by providing an additional training signal to make better decisions, taking into account our findings from the detailed analysis in §3.1.3. Experimental Setup. We use finetuned T5 [117] models throughout our work following prior efforts for analyzing [157] and generating [98, 74] free-text explanations. More specifically, we finetune three model classes based on the T5-base architecture: • I→O. Predict the label directly from the question and answer options. • IR→O. Predict the label from the question, answer options and the rationale. • I→R. Predict the rationale from the question and answer options. For the IR→O model, we experiment with different variations based on the source, and the quantity of the rationales R, provided during training. Since most of our experiments deal with the first two model classes, we report accuracy of output label prediction. 47 c1 c2 c3 I I+RECQA %RTr w/o shuffle shuffled r1 I→O - 57.00 53.32 54.95 r2 IRTr ECQA →O 5% 55.45 93.94 76.66 r3 10% 55.36 96.56 73.46 r4 20% 54.55 97.21 70.02 r5 30% 53.64 97.46 66.91 r6 100% 31.44 97.79 76.33 Table 3.7: The importance of shuffling the order of sentences in ECQA rationales in training. Without shuffling, the model relies on the spurious correlation due to sentence order, as compared to r7-11/c4 in Tab. 3.6. Accuracies are averaged across 3 random seeds (s.d. as subscript) for %R selection during training, as in Tab. 3.6. We use rationales for the ComQA training instances to train two different sets of IR→O models, for CoS-E and ECQA respectively. Under each set, we train five different models, randomly selecting different amounts (5%, 10%, 20%, 30% and the full 100%) of CoS-E and shuffled-ECQA rationales for training. During training, we use varying amounts (5%, 10%, 20%, 30% and the full 100%) of CoS-E and shuffled-ECQA rationales, to study how the quantity of rationales affects the model performance. During inference, we provide the IR→O T5 models with rationales under each of the four categories of CoS-E, as in Table 3.5, as well as all combined together. For ECQA, we report performance for inference with and without shuffled rationales. Finally, we study how rationales from one dataset transfer to the other. Crowdsourced rationales boost model performance, ruling out leakage. Comparing c1 in Table 3.6 with the columns c2-c8, we see that rationales help improve the model’s ability to make the correct prediction, even when including only 5% of the rationales during training. However, instances that leak the answer make up a large portion of CoS-E. Indeed, when provided at test time, rationales which neither leak the correct answer nor provide additional background knowledge, cause the least improvement in model performance (c7). Further, with background knowledge, but no leakage, model performance can still be improved (c5); after adding 5% of the training data, the model reaches 59.89% accuracy with Cno-leak-bg 48 %RTr I I+RQuaRTz r1 I→O - 70.88 38.27 r2 IRTr QuaRTz → O 5% 66.201.33 67.861.18 r3 10% 67.811.15 70.581.25 r4 20% 67.990.54 69.730.97 r5 30% 67.130.69 71.510.16 r6 100% 64.67 81.51 Table 3.8: QuaRTz model accuracy with and without training with knowledge statements as rationales. We report accuracies averaged across 3 random seeds (s.d. as subscript) for %R selection during training, as in Table 3.6. rationales, which yields 47.2% improvement, compared to 40.68% without rationales.7 Overall, a close inspection of the rationales is necessary to understand when they can help the model decision for the right reasons (i.e. providing background information, not simply by leaking the answer). In other words, models can benefit from those crowdsourced rationales which provide utility for human interpretability as well! Not all rationales are the same. We see benefits from increasing the amount of ECQA rationales in the training data (r7-r11/ c4), even in a transfer setting (r7-r11/ c2). However, this trend is weaker when training with CoS-E (r2 − r6). This highlights the importance of a rigorous procedure for crowdsourcing rationales [2]. Spurious correlations in rationales must be minimized. Recall from §3.1.4.3 that ECQA rationales tend to follow an ordering: sentences rationalizing the correct option precede those refuting the incorrect ones. To validate the importance of shuffling sentences in ECQA rationales, we present a baseline in Table 3.7 which considers unshuffled rationales during training, to be compared to training with shuffled rationales in Table 3.6. In the unshuffled case, training with only 5% rationales improves the accuracy on unshuffled test rationales from 53.32% to 93.94% (c2, Tab. 3.7). However, when we test the same model using 7Unlike test rationales from other categories, the trends are not monotonic for Rno-leak-bg, most likely because this is the smallest (only 4%) subset of the test set (Table 3.5). 49 shuffled rationales, the accuracy improves from 54.95% to 76.66% (c3). This shows that the model might learn a spurious correlation between the rationale and correct answer, due to ordering. We recommend shuffling ECQA rationales before using them for model training. Training with non-leaky rationales is beneficial. Despite taking care to prevent spurious correlations in ECQA, there is still a chance that the models benefit from some amount of leakage of the correct answer, an uninteresting use of rationales to improve model performance. To control for this, we consider the QuaRTz dataset, introduced in §3.1.2, using knowledge statements as rationales, which are designed to contain no leakage, but provide the background information. Using a similar setup to our ComQA experiments above, we finetune T5 models for both I→O and IR→O models on QuaRTz. Results in Table 3.8 show that the non-leaky QuaRTz rationales improve a model’s ability to predict the correct answer, consistent with our findings in Table 3.6. These highlight the generalizability of our conclusions. 3.2 Event Gender Bias in Wikipedia and Gender Bias in LLM-Generated Recommendation Letters Section 3.1 shows that the quality of data in training would directly have an impact on the model. In this section, we show one piece of potential evidence suggesting that large language models carry gender bias might also result from the gender bias in the training data. 3.2.1 Event Gender Bias in Wikipedia 8 3.2.1.1 Motivation and Contribution Researchers have been using NLP tools to analyze corpora for various tasks on online platforms. For example, [104] found that female-female interactions are more intimate than male-male interactions on 8 Please refer to the published work [61]. 50 Name Wikipedia Description Loretta Young (F) Career: In 1930, when she was 17, she eloped with 26-year-old actor Grant Withers; they were married in Yuma, Arizona. The marriage was annulled the next year, just as their second movie together (ironically entitled Too Young to Marry) was released . Grant Withers (M) Personal Life: In 1930, at 26, he eloped to Yuma, Arizona with 17-year-old actress Loretta Young. The marriage ended in annulment in 1931 just as their second movie together, titled Too Young to Marry, was released . Table 3.9: The marriage events are under the Career section for the female on Wikipedia. However, the same marriage is in the Personal Life section for the male. yellow background highlights events in the passage. Twitter and Reddit. Different from social media, open collaboration communities such as Wikipedia have slowly won the trust of public [164]. Wikipedia has been trusted by many, including professionals in work tasks such as scientific journals [71] and public officials in powerful positions of authority such as court briefs [40]. Implicit biases in such knowledge sources could have a significant impact on audiences’ perception of different groups, thus propagating and even amplifying societal biases. Therefore, analyzing potential biases in Wikipedia is imperative. In particular, studying events in Wikipedia is important. An event is a specific occurrence under a certain time and location that involves participants [166]; human activities are essentially sequences of events. Therefore, the distribution and perception of events shape the understanding of society. [122] discovered implicit gender biases in film scripts using events as a lens. For example, they found that events with female agents are intended to be helpful to other people, while events with male agents are motivated by achievements. However, they focused on the intentions and reactions of events rather than events themselves. 51 In this work, we propose to use events as a lens to study gender biases and demonstrate that events are more efficient for understanding biases in corpora than raw texts. We define gender bias as the asymmetric association of events with females and males,9 which may lead to gender stereotypes. For example, females are more associated with domestic activities than males in many cultures [80, 65]. To facilitate the study, we collect a corpus that contains demographic information, personal life description, and career description from Wikipedia.10 We first detect events in the collected corpus using a state-of-the-art event extraction model [42]. Then, we extract gender-distinct events with a higher chance to occur for one group than the other. Next, we propose a calibration technique to offset the potential confounding of gender biases in the event extraction model, enabling us to focus on the gender biases at the corpus level. Our contributions are three-fold: • We contribute a corpus of 7,854 fragments from 10,412 celebrities across 8 occupations including their demographic information and Wikipedia Career and Personal Life sections. • We propose using events as a lens to study gender biases at the corpus level, discover a mixture of personal life and professional life for females but not for males, and demonstrate the efficiency of using events in comparison to directly analyzing the raw texts. • We propose a generic framework to analyze event gender bias, including a calibration technique to offset the potential confounding of gender biases in the event extraction model. 3.2.1.2 Experimental Setup In this section, we will introduce our collected corpus and the event extraction model in our study. 9 In our analysis, we limit to binary gender classes, which, while unrepresentative of the real-world diversity, allows us to focus on more depth in analysis. 10https://github.com/PlusLabNLP/ee-wiki-bias. 52 Career Personal Life Collected Occ F M F M F M Acting 464 469 464 469 464 469 Writer 455 611 319 347 1,372 2,466 Comedian 380 655 298 510 642 1,200 Artist 193 30 60 18 701 100 Chef 81 141 72 95 176 350 Dancer 334 167 286 127 812 465 Podcaster 87 183 83 182 149 361 Musician 39 136 21 78 136 549 All 4,425 3,429 10,412 Table 3.10: Statistics showing the number of celebrities with Career section or Personal Life section, together with all celebrities we collected. Not all celebrities have Career or Personal Life sections. Dataset. Our collected corpus contains demographics information and description sections of celebrities from Wikipedia. Table 3.10 shows the statistics of the number of celebrities with Career or Personal Life sections in our corpora, together with all celebrities we collected. In this work, we only explored celebrities with Career or Personal Life sections, but there are more sections (e.g., Politics and Background and Family) in our collected corpus. We encourage interested researchers to further utilize our collected corpus and conduct studies from other perspectives. In each experiment, we select the same number of female and male celebrities from one occupation for a fair comparison. Event Extraction. There are two definitions of events: one defines an event as the trigger word (usually a verb) [112], the other defines an event as a complex structure including a trigger, arguments, time, and location [3].add citations The corpus following the former definition usually has much broader coverage, while the latter can provide richer information. For broader coverage, we choose a state-of-the-art event detection model that focuses on detecting event trigger words by Han2019JointEA.11 We use the model trained on the TB-Dense dataset [111] for two reasons: 1) the model performs better on the TB-Dense dataset; 2) the annotation of the TB-Dense dataset is from the news articles, and it is also where the 11We use the code at https://github.com/rujunhan/EMNLP-2019 and reproduce the model trained on the TB-Dense dataset. 53 Metric TB-D S S-F S-M Precision 89.2 93.5 95.3 93.4 Recall 92.6 89.8 87.1 89.8 F1 90.9 91.6 91.0 91.6 Table 3.11: The performance for off-the-shelf event extraction model in both common event extraction dataset TB-Dense (TB-D) and our corpus with manual annotation. S represents the sampled data from the corpus. S-F and S-M represent the sampled data for female career description and male career description separately. Occupation Events in Female Career Description Events in Male Career Description WEAT∗ WEAT Writer divorce, marriage, involve, organize, wedding argue, election, protest, rise, shoot -0.17 1.51 Acting divorce, wedding, guest, name, commit support, arrest, war, sue, trial -0.19 0.88 Comedian birth, eliminate, wedding, relocate, partner enjoy, hear, cause, buy, conceive -0.19 0.54 Podcaster land, interview, portray, married, report direct, ask, provide, continue, bring -0.24 0.53 Dancer married, marriage, depart, arrive, organize drop, team, choreograph, explore break -0.14 0.22 Artist paint, exhibit, include, return, teach start, found, feature, award, begin -0.02 0.17 Chef hire, meet, debut, eliminate, sign include, focus, explore, award, raise -0.13 -0.38 Musician run, record, death, found, contribute sign, direct, produce, premier, open -0.19 –0.41 Annotations: Life Transportation Personell Conflict Justice Transaction Contact Table 3.12: Top 5 extracted events that occur more often for females and males in Career sections across 8 occupations. We predict event types by applying EventPlus [90] on sentences that contain target events and take the majority vote of the predicted types. The event types are from the ACE dataset. We calculate WEAT scores with all tokens excluding stop words (WEAT∗ column) and only detected events (WEAT column) for Career sections. most content of Wikipedia comes from.12 We extract and lemmatize events e from the corpora and count their frequencies |e|. Then, we separately construct dictionaries E m = {e m 1 : |e m 1 |, ..., em M : |e m M|} and E f = {e f 1 : |e f 1 |, ..., e f F : |e f F |} mapping events to their frequency for male and female respectively. Event Extraction Quality. To check the model performance on our corpora, we manually annotated events in 10,508 sentences (female: 5,543, male: 4,965) from the Wikipedia corpus. Table 3.11 shows that the model performs comparably on our corpora as on the TB-Dense test set. 12According to [34], more than 20% of the references are news articles on Wikipedia. 54 argue election protest rise shoot event 0 5 10 15 20 25 30 percent gender male female (a) Male Writers divorce marriage involve wedding organize event 0 5 10 15 20 25 percent gender male female (b) Female Writers support arrest war sue trial event 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 percent gender male female (c) Actor divorce wedding guest name commit event 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 percent gender male female (d) Actress Figure 3.2: The percentile of extracted events among all detected events, sorted by their frequencies in descending order. The smaller the percentile is, the more frequent the event appears in the text. The extracted events are among the top 10% for the corresponding gender (e.g., extracted female events among all detected events for female writers) and within top 40% percent for the opposite gender (e.g., extracted female events among all detected events for male writers). The figure shows that we are not picking rarelyoccurred events, and the result is significant. 3.2.1.3 Detecting Gender Biases in Events Odds Ratio. After applying the event detection model, we get two dictionaries E m and E f that have events as keys and their corresponding occurrence frequencies as values. Among all events, we focus on those with distinct occurrences in males and females descriptions (e.g., work often occurs at a similar frequency for both females and males in Career sections, and we thus neglect it from our analysis). We use the Odds Ratio (OR) [143] to find the events with large frequency differences for females and males, 55 which indicates that they might potentially manifest gender biases. For an event en, we calculate its odds ratio as the odds of having it in the male event list divided by the odds of having it in the female event list: E m(en) Pi em i ̸=en i∈[1,...,M] Em(e m i ) / E f (en) Pj e f j ̸=en j∈[1,...,F] E f (e f j ) (3.1) The larger the OR is, the more likely an event will occur in male than female sections by Equation 3.1. After obtaining a list of events and their corresponding OR, we sort the events by OR in descending order. The top k events are more likely to appear for males and the last k events for females. Calibration. The difference of event frequencies might come from the model bias, as shown in other tasks (e.g., gender bias in coreference resolution model [170]). To offset the potential confounding that could be brought by the event extraction model and estimate the actual event frequency, we propose a calibration strategy by 1) generating data that contains target events; 2) testing the model performance for females and males separately in the generated data, 3) and using the model performance to estimate real event occurrence frequencies. We aim to calibrate the top 50 most skewed events in females’ and males’ Career and Personal Life descriptions after using the OR separately. First, we follow two steps to generate a synthetic dataset: 1. For each target event, we select all sentences where the model successfully detected the target event. For each sentence, we manually verify the correctness of the extracted event and discard the incorrect ones. For the rest, we use the verified sentences to create more ground truth; we call them template sentences. 2. For each template sentence, we find the celebrity’s first name and mark it as a Name Placeholder, then we replace it with 50 female names and 50 male names that are sampled from the name list 12ACE dataset: https://www.ldc.upenn.edu/collaborations/past-projects/ace 56 by checklist. If the gender changes during the name replacement (e.g., Mike to Emily), we replace the corresponding pronouns (e.g., he to she) and gender attributes [170] (e.g., Mr to Miss) in the template sentences. As a result, we get 100 data points for each template sentence with automatic annotations. If there is no first name in the sentence, we replace the pronouns and gender attributes. After getting the synthetic data, we run the event extraction model again. We use the detection recall among the generated instances to calibrate the frequency |e| for each target event and estimate the actual frequency |e| ∗ , following: |e| ∗ = |e| T P(e)/(T P(e) + F P(e)) (3.2) Then, we replace |e| with |e| ∗ in Equation 3.1, and get k female and k male events by sorting OR as before. Note that we observe the model performances are mostly unbiased, and we have only calibrated events that have different performances for females and males over a threshold (i.e., 0.05). WEAT score. We further check if the extracted events are associated with gender attributes (e.g., she and her for females, and he and him for males) in popular neural word embeddings like Glove [105]. We quantify this with the Word Embedding Association Test (WEAT) [13], a popular method for measuring biases in text. Intuitively, WEAT takes a list of tokens that represent a concept (in our case, extracted events) and verifies whether these tokens have a shorter distance towards female attributes or male attributes. A positive value of WEAT score indicates that female events are closer to female attributes, and male events are closer to male attributes in the word embedding, while a negative value indicates that female events are closer to male attributes and vice versa. To show the effectiveness of using events as a lens for gender bias analysis, we compute WEAT scores on the raw texts and detected events separately. For the former, we take all tokens excluding stop words.13 12We did not show the result for the artists and musicians due to the small data size. 13We use spaCy (https://spacy.io/) to tokenize the corpus and remove stop words. 57 Together with gender attributes from WEAT, we calculate and show the WEAT scores under two settings as “WEAT∗ ” for the raw texts and “WEAT” for the detected events. Occupation Events in Female Personal Life Description Events in Male Personal Life Description WEAT∗ WEAT Writer bury, birth, attend, war, grow know, report, come, charge, publish -0.05 0.31 Acting pregnant, practice, wedding, record, convert accuse, trip, fly, assault, endorse -0.14 0.54 Comedian feel, birth, fall, open, decide visit, create, spend, propose, lawsuit -0.07 0.07 Podcaster date, describe, tell, life, come play, write, born, release, claim -0.13 0.57 Dancer marry, describe, diagnose, expect, speak hold, involve, award, run, serve -0.03 0.41 Chef death, serve, announce, describe, born birth, lose, divorce, speak, meet -0.02 -0.80 Annotations: Life Transportation Personell Conflict Justice Transaction Contact Table 3.13: Top 5 events in Personal Life section across 6 occupations. There are more Life events (e.g., “birth” and “marry”) in females’ personal life descriptions than males’ for most occupations. While for males, although we see more life-related events than in the Career section, there are events like “awards” even in the Personal Life section. The findings further show our work is imperative and addresses the importance of not intermingling the professional career with personal life regardless of gender during the future editing on Wikipedia. 3.2.1.4 Results The Effectiveness of our Analysis Framework. Table 3.12 and Table 3.13 show the associations of both raw texts and the extracted events in Career and Personal Life sections for females and males across occupations after the calibration. The values in WEAT∗ columns in both tables indicate that there was only a weak association of words in raw texts with gender. In contrast, the extracted events are associated with gender for most occupations. It shows the effectiveness of the event extraction model and our analysis method. The Significance of the Analysis Result. There is a possibility that our analysis, although it picks out distinct events for different genders, identifies the events that are infrequent for all genders and that the frequent events have similar distributions across genders. To verify, we sort all detected events from our corpus by frequencies in descending order. Then, we calculate the percentile of extracted events in the sorted list. The smaller the percentile is, the more frequent the event appears in the text. Figure 3.2 58 shows that we are not picking the events that rarely occur, which shows the significance of our result. For example, Figure 3.2a and Figure 3.2b show the percentile of frequencies for selected male and female events among all events frequencies in the descending order for male and female writers, respectively. We can see that for the corresponding gender, event frequencies are among the top 10%. These events occur less frequently for the opposite gender but still among the top 40%. Findings and Discussions. We find that there are more Life events for females than males in both Career and Personal Life sections. On the other hand, for males, there are events like “awards” even in their Personal Life section. The mixture of personal life with females’ professional career events and career achievements with males’ personal life events carries implicit gender bias and reinforces the gender stereotype. It potentially leads to career, marital, and parental status discrimination towards genders and jeopardizes gender equality in society. We recommend: 1) Wikipedia editors to restructure pages to ensure that personal life-related events (e.g., marriage and divorce) are written in the Personal Life section, and professional events (e.g., award) are written in Career sections regardless of gender; 2) future contributors should also be cautious and not intermingle Personal Life and Career when creating the Wikipedia pages from the start. 3.2.1.5 Conclusion We conduct the first event-centric gender bias analysis at the corpus level and compose a corpus by scraping Wikipedia to facilitate the study. Our analysis discovers that the collected corpus has event gender biases. For example, personal life related events (e.g., marriage) are more likely to appear for females than males even in Career sections. We hope our work brings awareness of potential gender biases in knowledge sources such as Wikipedia, and urges Wikipedia editors and contributors to be cautious when contributing to the pages. 59 3.2.2 Gender Bias in LLM-Generated Reference Letters Wikipedia is known to be one of the sources that researchers train LLMs. Given the known conclusion that Wikipedia contains gender bias, we ask the question of whether such gender bias would propagate to the LLMs that are trained on it and further influence downstream tasks. More specifically, we choose LLM-generated reference letters as the downstream task for our study. 3.2.2.1 Why Reference Letters? LLMs have emerged as helpful tools to facilitate the generation of coherent long texts, enabling various use cases of document generation [131, 100, 138, 41]. Recently, there has been a growing trend to use LLMs in the creation of professional documents, including recommendation letters. The use of ChatGPT for assisting reference letter writing has been a focal point of discussion on social media platforms14 and reports by major media outlets15. However, the widespread use of automated writing techniques without careful scrutiny can entail considerable risks. Such biases might also infiltrate the application of automated reference letter generation and cause substantial societal harm, as research in social sciences [91, 69] unveiled how biases in professional documents lead to diminished career opportunities for gender minority groups. 3.2.2.2 Bio-based Reference Letter Generation Data Preprocessing We utilize personal biographies as context information for CBG task. Specifically, we further preprocess and use WikiBias that we collected in Section 3.2.1, a personal biography dataset with scraped demographic and biographic information from Wikipedia. Our data augmentation pipeline aims at producing an anonymized and gender-balanced biography datasest as context information for reference letter generation to prevent pre-existing biases. 14See, for example, the discussion on Reddit https://shorturl.at/eqsV6 15For example, see the article published in the Atlantic https://shorturl.at/fINW3. 60 Prompt Design We use prompting to obtain LLM-generated by providing the model with more context information in the form of personal biographies in the input. Specifically, we use biographies in the pre-processed WikiBias dataset as contextual information. We verbalize biographies in the WikiBias dataset with the designed prompt templates and query LLMs with the combined information. Upon filtering out unsuccessful generations with the criterion, we get 6, 028 generations for ChatGPT and 4, 228 successful generations for Alpaca. Evaluation: Biases in Lexical Content Given our aim to investigate biases in nouns and adjectives as lexical content, we first extract words of the two lexical categories in professional documents. To do this, we use the Spacy Python library [48] to match and extract all nouns and adjectives in the generated documents for males and females. After collecting words in documents, we create a noun dictionary and an adjective dictionary for each gender to further apply the odds ratio analysis. Evaluation: Biases in Language Style We implement three corresponding metrics for evaluation. Biases in Language Formality For evaluation of biases in language formality, we first classify the formality of each sentence in generated letters, and calculate the percentage of formal sentences in each generated document. To do so, we apply an off-the-shelf language formality classifier from the Transformers Library that is fine-tuned on Grammarly’s Yahoo Answers Formality Corpus (GYAFC) [121]. We then conduct statistical t-tests on formality percentages in male and female documents to report significance levels. Biases in Language Positivity Similarly, for evaluation of biases in language positivity, we calculate and conduct t-tests on the percentage of positive sentences in each generated document for males and females. To do so, we apply an off-the-shelf language sentiment analysis classifier from the Transformers Library that was fine-tuned on the SST-2 dataset [136]. 61 Language Agency Classifier Along similar lines, for evaluation of biases in language agency, we conduct t-tests on the percentage of agentic sentences in each generated document for males and females. Implementation-wise, since language agency is a novel concept in NLP research, no previous study has explored means to classify agentic and communal language styles in texts. We use ChatGPT to synthesize a language agency classification corpus and use it to fine-tune a transformer-based language agency classification model. Model Aspect Male Female ChatGPT Nouns man, father, ages, actor, thinking, colleague, flair, expert, adaptation, integrity actress, mother, perform, beauty, trailblazer, force, woman, adaptability, delight, icon Adj respectful, broad, humble, past, generous, charming, proud, reputable, authentic, kind warm, emotional, indelible, unnoticed, weekly, stunning, multi, environmental, contemporary, amazing Alpaca Nouns actor, listeners, fellowship, man, entertainer, needs, collection, thinker, knack, master actress, grace, consummate, chops, none, beauty, game, consideration, future, up Adj classic, motivated, reliable, non, punctual, biggest, political, orange, prolific, dependable impeccable, beautiful, inspiring, illustrious, organizational, prepared, responsible, highest, ready, remarkable Table 3.14: Qualitative evaluation results on ChatGPT for biases in Lexical Content. Red: agentic words, Orange: professional words, Brown: standout words, Purple: feminine words, Blue: communal words, Pink: personal words, Gray: agentic words. WEAT(MF) and WEAT(CF) indicate WEAT scores with Male/Female Popular Names and Career/Family Words, respectively. Result. Table 3.14 shows results for biases in lexical content on ChatGPT and Alpaca. Specifically, we show the top 10 salient adjectives and nouns for each gender. We observe that both ChatGPT and Alpaca tend to use gender-stereotypical words in the generated letter (e.g. “respectful” for males and “warm” for females). Such findings echo with our conclusion in Section 3.2.1 that personal life related events such as marriage/pregnant/birth appeared in the Career sections for females, suggesting that the gender bias in LLM-generated reference letters might due to the existing gender bias on Wikipedia. 62 Chapter 4 Efficient High-Quality Data Acquisition Leads to Better Models There are many ways to acquire high-quality data. During my Ph.D. study, I have explored 1) collaborating with human annotators as in ExPUNation [60] in Section 4.1, 2) scraping high-quality online forums as in LIMA [171] and 3) use feedback from external AI models to generate high-quality synthetic data for training, as in DreamSync [57] in Section 4.2. 4.1 ExPUNations: Augmenting Puns with Keywords and Explanations 1 4.1.1 Motivation and Contribution Humor serves multiple purposes and provides numerous benefits, such as relieving anxiety, avoiding painful feelings and facilitating learning [11]. As a specific example of humor, the creative uses of puns, wordplay and ambiguity are important ways to come up with jokes [21]. Pun understanding and generation are particularly challenging tasks because they require extensive commonsense and world knowledge to compose and understand, even for humans. Despite growing interest in the area, there are limited amounts of data available in the domain of humor understanding and generation. Existing humor datasets are usually only annotated with binary labels indicating whether each sentence is a joke, pun, or punchline [43, 154, 15, 95]. This is insufficient to benchmark models’ ability to 1 Please refer to the published work [60]. 63 Text When artists dream in color it’s a pigment of their imagination. KWD artists , dream , color , pigment , imagination . NLEx Pigments are non-soluble materials often used in painting, and pigment sounds like figment, which is something that is not real but someone believes it is. Text The man found something to catch fish, which was a net gain. KWD catch fish , net gain . NLEx This is a play on words. A “net gain” means an increase in revenue but here “net” refers to how a net is used to catch fish. Table 4.1: Two examples of annotated Keywords (KWD) and Natural Language Explanations (NLEx) for puns in our dataset. The highlighted texts are annotated keywords that contribute to making the text funny. Text Be True to your teeth, or they will be false to you. Drinking too much of a certain potent potable may require a leave of absinthe. Understandable [1, 1, 1, 1, 0] [1, 1, 1, 1, 1] Offensive/Inappropriate [0, 1, 0, 0, 0] [0, 0, 0, 0, 0] Is a joke? [1, 0, 1, 0, 0] [1, 1, 1, 1, 1] Funniness (1-5) [2, 0, 1, 0, 0] [3, 4, 2, 1, 2] Natural Language Explanation (NLEx) NLEx1: Talking about being true as in being real or they will be fake/false teeth. NLEx2: False teeth are something people who lose their teeth may have, and being true to your teeth may be a way of saying take care of them otherwise you’ll lose them. NLEx1: It’s a pun that replaces the word absence with absinthe, which is notoriously strong alcohol. NLEx2: This is a play on words. Absinthe here represents the liquor by the same name but is meant to replace the similar-sounding “absence”. Too much absinthe will make you ill. Joke keywords (KWD) KWD1: [“true”, “teeth”, “false”] KWD2: [“be true”, “teeth”, “false to you”] KWD1: [“drinking”, “leave of absinthe”] KWD2: [“drinking too much”, “leave of absinthe”] Table 4.2: Two examples with annotation fields that we collect. We use underline to mark the commonsense knowledge that people need in order to understand the joke. understand and generate novel humorous text, since hardly anything meaningful can be learned from such a sparse supervision signal and coarse-grained annotation. To facilitate research on humor understanding and generation, we present the ExPUNations (ExPUN) dataset, in which we augment an existing dataset of puns from SemEval 2017 Task 7 [94] with detailed crowdsourced annotations of fine-grained funniness ratings on a Likert scale of one to five, along with keywords denoting the most distinctive words that make the text funny and natural language explanations describing why the text is funny (Table 4.1). In addition, we collect annotations indicating whether a person understands the sentence, thinks it is a pun, and finds the joke offensive or inappropriate. Since these tasks 64 are all highly subjective, we collect multiple annotations per sample, and present a detailed agreement analysis. We believe our annotations can be used in many other applications beyond pun understanding and generation, such as toxicity detection. The contributions of our work are threefold: First, we contribute extensive high-quality annotations for an existing humor dataset along multiple dimensions.2 Secondly, based on the annotations, we propose two tasks, explanation generation for pun classification and keyword-conditioned pun generation, to advance research on humor understanding and generation. Thirdly, we benchmark state-of-the-art NLP models on explanation generation for pun classification and keyword-conditioned pun generation. Our experiments demonstrate the benefits of utilizing natural language keywords and explanations for humor understanding and generation while highlighting several potential areas of improvement for the existing models. 4.1.2 ExPUN Dataset In this section, we describe our data annotation procedure, including details of the annotation fields and our assessment of the annotation quality. 4.1.2.1 Data Preparation The original SemEval 2017 Task 7 dataset [94] 3 contains puns that are either homographic (exploiting polysemy) or heterographic (exploiting phonological similarity to another word). The dataset also contains examples of non-pun text. We sample 1,999 text samples from SemEval 2017 Task 7 as the basis for our humor annotation. 4 2Resources are available at: https://github.com/amazon-research/expunations 3 https://alt.qcri.org/semeval2017/task7/. The data is released under CC BY-NC 4.0 license (https:// creativecommons.org/licenses/by-nc/4.0/legalcode). 4We sample 834 heterographic puns, 1,074 homographic puns and 91 non-puns. 65 4.1.2.2 Dataset Annotation The annotated fields (AF) come in the order of: AF1 [understandability]: whether the annotator understands the text or not, regardless of whether they perceive it as funny. AF2 [offensiveness]: whether the annotator finds the text offensive or inappropriate. AF3 [joke]: whether the annotator thinks the text is intended to be a joke. AF4 [funniness]: rate the funniness on a Likert scale of 1-5, where 1 means very not funny and 5 means very funny. AF5 [explanation]: explain in concise natural language about why this joke is funny. More specifically, if external or commonsense knowledge is required to understand the joke and/or its humor, the annotator should include the relevant knowledge in the explanation. If the joke is a pun or play on words, they must provide an explanation of how the play on words works. AF6 [joke keywords]: pick out (as few as possible) keyword phrases from the joke that are related to the punchline/the reason the joke is funny. We emphasize that phrases should be sparse and mainly limited to content words, can be multiple words long, and the keywords should be copied verbatim from the joke. If an annotator rates the instance as not understandable, they will skip the rest of the annotation for that instance (AF2-AF6). In addition, if an annotator rates an example as not a joke, they can skip the rest of the annotation (AF4-AF6). Table 4.2 shows two examples in our dataset. The first example has two annotators who think the text is a joke, and therefore it has two explanations. In the second instance, all annotators unanimously agree it is a joke. Here, we sample two explanations from the original five. For both instances, we use underline to highlight the external commonsense knowledge in the explanation. 66 total AF1 AF2 AF3 # samples 1,999 1,795 65 1,449 AF4: Avg. funniness 1.68 AF5: Explanations total # explanations 6,650 avg. # explanations/sample 3.33 avg. # tokens/expl. 31.67 avg. # sentences/expl. 2.01 AF6: Keyword phrases avg. # tokens/keyword phrase 1.33 avg. # keyword phrases/sample 2.09 Table 4.3: Overall stats for annotation fields in ExPUN. If the joke is a play on words, the explanation also shows how the play on words works (e.g., the second joke). We crowdsourced 5 annotations per sample using a professional team of 10 dedicated full-time annotators within our organization. Before starting the task, we held a kick-off meeting with the team to explain the annotation guidelines in detail. We then conducted 3 pilot rounds for calibration and iteratively met with annotators, including more details and examples to address annotator questions. Finally, we conducted 7 rounds of annotation, each with between 100-300 puns per round grouped into minibatches of 50 examples. Each sample in a minibatch was annotated by consistent subteams of 5 annotators. After receiving a completed batch of annotations, we manually examined their quality and provided feedback on any quality issues, redoing batches as necessary. 4.1.2.3 Dataset Statistics and Quality Control We report overall dataset statistics in Table 4.3. For AF1 − AF3, we count the number of samples labeled positive by majority vote. For AF4, we compute the average of all funniness scores, excluding blank annotations, and find that while annotators recognized most samples as jokes, they did not find them to be particularly funny. For AF5 and AF6, we compute lexical statistics of our explanations and keyword annotations and provide deeper analysis of these key annotation fields in Section 4.1.2.4. 67 Annotation Field κ ρ BLEU MET. AF1: Understand (0/1) 0.40 0.16 - - AF2: Offensive (0/1) 0.16 0.34 - - AF3: Joke (0/1) 0.58 0.32 - - AF4: Funny (1-5) 0.41 0.30 - - AF5: Explain (Text) - - 0.18 0.30 AF6: Keywords (Text) - - 0.58 0.74 Table 4.4: Agreement stats for annotated fields in the ExPUN dataset. We report averaged Cohen’s κ and Spearman’s ρ for numeric ratings (AF1 − AF4), and averaged BLEU-4 and METEOR for text fields (AF5 − AF6). We report inter-annotator agreement for all annotation fields in Table 4.4. 5 For fields AF1-AF4, we compute agreement using (1) the average of Cohen’s kappa scores of each annotator against the majority vote, and (2) the average Spearman correlation between each pair of annotators. We find that annotators show moderate agreement when deciding if the given text is a joke (AF3), but lower agreement on the task of understanding the text (AF1) as well as the much more subjective task of rating how funny a joke is (AF4). We also find weak average Spearman correlation between each pair of annotations for the subjective categories of offensiveness (AF2), whether the text is a joke (AF3) and joke funniness (AF4). For the free text fields in AF5 and AF6, we compute averaged BLEU-4 papineni-etal-2002-bleu and METEOR banerjee-lavie-2005-meteor scores in a pairwise fashion. We treat each annotator’s explanation (for AF5) or list of keyword phrases joined into a string (for AF6) as candidate text, with the remaining annotators’ annotations as a set of references. We find high similarity between joke keyword annotations, suggesting that annotators identify similar spans of keyword phrases, and a lower degree of similarity between pun explanations. 4.1.2.4 Dataset Analysis Explanations. As seen in Figures 4.1a and 4.1b, on average, samples are annotated with multiple explanations, and the explanations are lengthy, spanning multiple sentences, and lexically diverse (14,748 token 5When computing agreement, we exclude the first 100 annotated samples, as these were used as a calibrating pilot. 68 vocabulary size, with 210,580 tokens overall). The frequent use of usually and often indicate the explanation of commonsense knowledge, e.g., thunder and lightning are usually present in a weather storm or “pain” means physical discomfort often felt by a hospital patient. The most frequent words, means and word, indicate that annotators frequently provide word sense information as part of their explanations, while sounds frequently appears in explanations of heterographic puns. Each of these most frequent words comprise less than 2.8% of all tokens in the explanations, illustrating the rich diversity of our corpus. Keywords. As seen in Figures 4.1c and 4.1d, on average, keyword phrases in ExPUN, which are derived from the original puns, are short and sparse (5,497 token vocabulary size, with 27,820 tokens overall). This follows from our guidelines to annotate keywords concisely, focusing mainly on content words that are essential to understanding the joke. Table 4.5 shows two examples of pun keyword annotations in our dataset that showcase different annotation styles among annotators. For instance, one annotator may tend to select wordy keyword phrases that introduce unnecessary tokens, while another may omit salient keywords that other annotators mention. Aggregating these annotations among annotators to construct a single ground truth set of keyword phrases is therefore challenging because of differing annotation styles. The problem of merging keywords is further complicated because the keywords from different annotators are often not aligned well, as different annotators may annotate varying numbers of keyword phrases and different spans. Taking these considerations into account, we propose a keyword aggregation algorithm to address these issues and construct a single set of aggregated keywords per sample. Keywords Aggregation. We propose the keyword aggregation algorithm in Algorithm 1 to merge keywords annotation among different workers. Algorithm 1 describes our keyword aggregation method. The algorithm aims to generate a comprehensive list of concise keywords for each sample. First, we compute a reliability score for each annotation, defined as the average of (# keyword phrases−# average tokens in each keyword phrase). The higher the 69 (a) Tokens/explanation (b) Sentences/explanation (c) Tokens/keyword phrase (d) Keyword phrases/sample Figure 4.1: Distributions of (a) number of tokens and (b) number of sentences in explanations (AF5), (c) tokens in keyword phrases (AF6), and (d) keyword phrases per sample. Horizontal lines are used to show the min, mean, and max values for each distribution. Algorithm 1 Keyword Aggregation Algorithm Input: For each instance Xi , i ∈ {1, ..., N}, annotations from every worker wj , j ∈ {1, ..., 5} denoted as Xij . Output: keywords for Xi 1: for j ∈ {1, ..., 5} do 2: // calculate the reliability score Sj = 1 N PN i=0 (#keywords−#average tokens in each keyword) 3: end for 4: sort all workers with S and get preferred worker list L 5: // set worker with the highest S as anchor worker wa 6: aggregated_keywords = [] 7: for Kz ∈ Xia do 8: filtered_keywords Kfilter = [] 9: for j ∈ {1, ..., 5} do 10: for Kp ∈ Xij do 11: calculate F(Kz, Kp) 12: end for 13: choose the keyword KP in Xij with highest F 14: if F(Kz, KP ) > 60 then 15: append keyword KP to Kfilter 16: end if 17: end for 18: AV Ga = 1 len(Ff ilter) PF(Kz, K)K ∈ Kfilter 19: set the worker with the second highest S as new anchor worker wb. Repeat L6-L18 and get AV Gb 20: if AV Ga ≥ AV Gb then 21: append Xia to aggregated_keywords 22: else 23: append Xib to aggregated_keywords 24: end if 25: /* if only one worker has keyword annotation, append this worker’s annotation to aggregated_keywords */ 26: Remove duplication from aggregated_keywords 27: end for score, the more comprehensive and concise the keywords from an annotator should be. We choose the annotator with the highest score to be the anchor. We note, however, that keyword annotations are not always error-free; e.g., in the first example of Table 4.5, w4 has an incorrect word (fancy chairs instead 70 Royal chairs are rarely throne out. She didn’t marry the gardener. Too rough around the hedges. w1 [Royal chairs, throne out] [didn’t marry the gardener, too rough around the hedges] w2 [Royal chairs, throne out] [didn’t marry the gardener, rough around the hedges] w3 [Royal chairs, rarely throne out] [didn’t marry the gardener, rough around the hedges] w4 [fancy chairs, throne] [gardener, rough, hedges] w5 [Royal chairs, throne] [gardener, rough around the hedges] wA [royal chairs, throne out] [gardener, rough, hedges] Table 4.5: Keyword annotations from different workers. wA shows aggregated keywords from our algorithm. of royal chairs). Therefore, for each keyword phrase, we compute the fuzzy matching score between the anchor’s annotation with the rest of annotators’ annotations. For each annotator, we keep the keyword phrase that has the highest fuzzy matching score with the anchor annotator’s, with a minimum threshold score of 60. 6 This process produces a filtered keyword list where each of the remaining keyword phrases look similar to the anchor’s. Then, we compute the average fuzzy matching score between the anchor’s keyword phrase and each element in the filtered keyword list. We then choose the annotator with the second-highest reliability score to be the anchor, and repeat the above process. Finally, by choosing the resulting keyword phrases that attain the maximum average fuzzy matching score between the first and second anchors, we get the final aggregated keywords for this instance. 4.1.3 Experiments With the collected annotations, we propose two new tasks, pun explanation and keyword conditioned pun generation, to showcase novel tasks that our dataset uniquely enables and push the frontiers of NLU and NLG for humor. Note that the rich annotations in ExPUN can also enable many other interesting tasks 6This is empirically determined. 71 (a) Gold explanations during test. (b) Generated explanations during test. (c) ELV model. Figure 4.2: The impact of using human-written (4.2a) and model-generated explanations (4.2b and 4.2c) vs. no explanations (constant dotted lines) on pun classification accuracy. All reported numbers are computed with three-seed average. For each data point, we train a model on the full dataset, but only provide explanations for a given percentage, as shown on the x-axis. such us pun keywords extraction, fine-grained funniness prediction, and others. However, we prioritize NLG tasks as they are relatively under-explored compared to NLU tasks. In this section, we benchmark current state-of-the-art models’ performance on the proposed tasks. 4.1.3.1 Pun Explanation The task of pun explanation takes a pun sentence as input and outputs a natural language explanation of why the pun is funny. This requires extensive understanding of background and commonsense knowledge. We hypothesize that existing NLP models would struggle to generate high-quality explanations for puns. On the other hand, high-quality explanations can improve humor understanding, and thus help tasks such as humor classification. Formally, given text T, our target is to generate an explanation ET of why T is funny. Additionally, we use the explanations to support the task of pun classification, where, given T (and optionally an explanation ET ), we output whether T is a joke. Data Preparation. For each data sample, we use the longest human-written explanation from ExPUN (AF5), substituting in the pun text if no explanations exist. 7 For pun classification, we assign output 7Only 168 samples have no annotated explanations. 72 labels using the majority vote of AF3 (is a joke). For both tasks, we split our dataset into 1,699/100/200 for train/dev/test. Dev and test contain an equal distribution jokes to non-jokes, while training contains 1,299 jokes and 400 non-jokes. Evaluation Metrics. We do not report lexical overlap metrics as our primary evaluation metric for generated explanations because these are not suited for measuring plausibility [14, 67, 25] or faithfulness of explanations [55]. Rather, we follow prior work and use the “simulatability score” metric from [157] to measure explanation quality from the lens of usability of the explanation. It reflects the utility of explanations by measuring the improvement in task performance when explanations are provided as additional input vs. when they are not: acc(IE → O) − acc(I → O), where I denotes the input text, E is the explanation and O is the classification of whether I is a joke. We evaluate how useful explanations can be by measuring the performance increase of acc(IE → O) as we increase the ratio of samples with explanations in the training data, and report acc(I → O) as a constant baseline that uses no explanations. Models. We use the following model variations: No explanations. As a baseline, we finetune BERT-base [28], RoBERTa-base [88] and DeBERTa-base [44] to classify whether the given text is a joke without any explanations in the input. Gold explanations. To find the upper bound of how useful explanations can be, we augment the input to the above baseline models with gold human-annotated explanations in both training and testing. The majority of non-punny examples (identified as unfunny by majority vote and thus labeled as unfunny) contain at least one explanation from an annotator who marked it as funny. In these cases, we use any provided explanations as E, both in training and in testing with gold explanations. Otherwise, to construct training examples that have no annotated explanations, or where explanations are held out, we try two variants: (1) representing the missing explanation as an empty string (“w/ gold expl.”), or (2) randomly sampling a negative explanation from another annotated example to use as input (“w/ gold + sampled neg.”). 73 Generated explanations. Following previous work on explanation generation [157], we first finetune a T5 [117] model to generate pun explanations given pun sentences as input. For text that contains no annotated explanations, we use the pun sentence itself as the output explanation. We then use gold human-annotated explanations to train and T5-generated explanations to test the explanation-augmented classification models. ELV [172]. ELV is a probabilistic framework for text classification where natural language Explanations are treated as Latent Variables. Two modules, an explanation generation module and an explanationaugmented prediction module are jointly trained using a variational EM framework. As another baseline, we train an ELV model for pun classification using the ExPUN dataset. Results. We show our results on the pun classification task in Figure 4.2. Baseline performance of the no explanations models are shown using constant dotted lines. Figure 4.2a shows the upper bound of performance improvement when models are provided with gold explanations, indicating that human-written explanations are useful for this task, and that including more gold explanations in training data generally helps. In particular, adding randomly-sampled negative explanations (“w/ gold + sampled neg.”) further improves the classification accuracy, showing the utility of our collected explanations in improving model performance. However, Figure 4.2b shows that using generated explanations at test time does not help to improve classification accuracy. Using the more carefully-designed ELV framework to jointly train the generation and classification modules shows improvement in classification accuracy (Figure 4.2c); however, qualitative analysis of the ELV explanations showed that many generated outputs are not fluent natural language, suggesting that performance improvements may stem more from modeling improvements as opposed to explanations. Given the huge improvements we see when incorporating gold explanations during test, we note explanations are clearly highly valuable if the quality of generated explanations can be improved. 74 Input Pun Generated Explanation My name is Cary. I’m a porter. The joke is a pun on the word “cary”. A porter is someone who transports goods. Fishers often wet their appetite. This is a play on words. The word “wet” means to wet your appetite, which is a characteristic of fish. A gossip is someone with a great sense of rumor. This is a play on words. The word “rumor” sounds like “rumor”. A gossip is someone who has a great sense of rumor. Oil executives are always using crude language. The joke is a pun on the word “crude”. Crude language is used to describe crude oil, which is a type of petroleum product. Please mix me a martini, said Tom, dryly. This is a play on words. The word “dryly” means dryly, but “dryly” sounds like “dryly”. Table 4.6: Pun explanations generated by the T5 model. We use underline to indicate the pun word in the input. Table 4.6 shows examples of T5-generated explanations for given puns. Qualitative analysis shows that generated explanations often identify the relevant pun word, and can include somewhat accurate word sense information for one sense of the pun. However, the model usually fails to explain the alternate word sense and its relation, which is crucial to understanding the wordplay. The model especially fails to explain phonological similarity in heterographic puns; e.g., in the first three examples, explanations fail to mention alternate wordscarry, whet and humor. For both pun types, our model can devolve into repetitively copying words from the input. Our results exhibit the challenge of generating good pun explanations and that highquality explanations are useful for understanding humor. 4.1.3.2 Keyword-Conditioned Pun Generation The task of keyword-conditioned pun generation takes human-annotated pun keywords as input and produces novel puns as output. This benchmarks models’ capability to draw connections among words to generate novel fluent, sensible, and humorous texts. This is a challenging task with many downstream 75 applications, such as context-situated humor generation, a task that involves generating humorous text in a given situation or context. In this case, input keywords can come from conversational context (e.g., chatbot dialogues) or narrative context (e.g., creative short stories). More formally, we take as input keywords K, the pun word pw and alternate pun word aw, 8 and produce novel and fluent puns that incorporate the keywords. 9 Optionally, we also include pun word sense annotations Spw and Saw from the original SemEval 2017 Task 7 annotations. Data Preparation. For this task, we limit our data to samples that contain both (1) annotated human keywords K from ExPUN (AF6), and (2) pun word sense annotations Spw and Saw from SemEval 2017 Task 7. There are 1,482 such samples that have both annotations, from which we reserve 100 as test data and use the rest for model training. To construct input human-annotated keywords for this task, we aggregate keywords for each sample using the method described in Section 4.1.2.4. Additionally, we evaluate the effect of finetuning on automatically-extracted keywords instead of human-annotated keywords by automatically extracting keywords for each sample by running the RAKE [129] algorithm on the pun text. Evaluation Metrics. We use both automatic metrics and human evaluation to evaluate the quality of generated puns. For automatic evaluation, we calculate word incorporation rate for both pun words and keywords, which measure the model’s ability to incorporate all input keywords. Additionally, we run human evaluation using Amazon Mechanical Turk, in which we asked Turkers to label whether or not a given generated pun was successful. 10 Models. We use the following models: 8 pw = aw for homographic puns. 9We refer to “fluent puns” primarily in the context of the pun generation task, since generating fluent natural language realizations is often non-trivial, particularly in the case of controllable language generation tasks such as ours. 10Turkers had to pass a qualifier by correctly labeling >= 80% of 20 samples that we manually annotated. Success is defined as whether the text supports both senses of the pun word. We measure inter-annotator agreement among 3 annotators using Fleiss’ kappa (κ = 0.49), showing moderate agreement. 76 Key- Word Incorp. % Success words Model pw K both Rate % RAKE T5FT 90.0 76.4 80.2 35.0 T5PT+FT 99.0 72.9 81.2 54.0 ExPUN AmbiPun 99.0 92.1 94.4 51.0 T5FT 58.0 80.3 72.3 40.0 T5PT+FT 93.0 80.2 83.5 77.0 Table 4.7: Automatic (Word Incorporation Rate) and human evaluation (Success %) of puns generated by models finetuned using automatically-extracted (RAKE) and human-annotated (ExPUN) keywords (with AmbiPun baseline [96]). PT stands for Pre-Training and FT stands for Fine-Tuning. Both T5PT+FT models finetuned with RAKE-based keywords or ExPUN-based keywords use RAKE-based keywords during pretraining. AmbiPun mittal2022ambipun. We use the current state-of-the-art homographic pun generation model, AmbiPun, with no further finetuning. We follow the AmbiPun prompt format: “generate sentence: K, pw, aw”. Finetuned T5 (T5FT). We finetune T5-base on ExPUN using input prompt “generate a pun that situated in K, using the word pw, pw means Spw , aw means Saw .” The output is the pun itself. Finetuned T5 with pretraining (T5PT+FT). To increase the model’s ability to incorporate keywords, we pretrain T5 on non-pun text. For a given pun word, we first extract 200 sentences that contain the pun word from BookCorpus bookcorpus, then use RAKE to automatically extract keywords for each sentence. We construct examples where inputs are automatically extracted keywords, and outputs are sentences from BookCorpus including pun words. We pretrain a T5 model with this data before finetuning it on ExPUN. Results. Table 4.7 shows results of our pun generation models. While the AmbiPun baseline achieves superior word incorporation performance, our T5PT+FT model finetuned using ExPUN keywords generates successful puns at a higher rate, showing the value of training on our dataset. Furthermore, while pun word incorporation is improved by pretraining on outside sources using RAKE keywords, using automaticallyextracted keywords when training on in-domain pun text does not translate to more successful puns. 77 # pw, aw K Generated Pun 1 solution/ solution scientist, problem, liquid chemicals A liquid chemicals scientist has a problem with a solution. 2 makeup/ makeup class, beauty school A beauty school class was cancelled because of a lack of makeup. 3 charges/ charges farmer, bull The farmer, the bull, had to pay the charges. 4 fission/ fishing nuclear physicist, vacation, trip The nuclear physicist took a trip to the Bahamas for his fission vacation. 5 fare/ fair carnival, county The carnival in the county was a fare event. 6 vault/ fault bankers, generous OLD BACHERS never die they just become very generous. They have a vault fault. Table 4.8: Examples of input pun words and keywords and the resulting generated puns. We show examples of both homographic and heterographic generated puns. Instead, models finetuned with the more carefully-selected, human-annotated ExPUN keywords generate puns relatively more successfully than their RAKE-trained counterparts. Table 4.8 shows examples of generated puns from our ExPUN-T5PT+FT model. The model is able to generate both homographic and heterographic puns somewhat coherently using one of the pun word senses. However, while some puns are successful, Rows 3 and 6 show some ways our model can struggle to generate the respective pun types: it does not always incorporate the alternate word sense in a clever or meaningful way, and can stitch copied input keywords together into incoherent sentences. Our results show pun generation is a very challenging task, and that careful selection of pun keywords and a deeper understanding of humor in wordplay is essential for generating puns successfully. 78 4.1.4 Conclusion We contribute a dataset of extensive, high-quality annotations of humor explanation, keywords, and finegrained funniness ratings. This is the first humor dataset with such extensive and fine-grained annotations. Based on the annotations, we propose two tasks: pun explanation and keyword-conditioned pun generation, to challenge state-of-the-art natural language understanding and generation models’ ability to understand and generate humorous text. We benchmark several strong models’ performances on the two proposed tasks to validate the practical usage of the proposed annotations, and show that our humanannotated explanations and keywords are beneficial in understanding and generating humor. Future directions include a deeper analysis of how to characterize pun explanation more objectively within our annotation scheme, as well as further exploration of better models for both the pun explanation and pun generation tasks. 4.2 DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback 11 4.2.1 Motivation and Introduction Although we invite creative liberty when we commission art, we expect an artist to follow our instructions. Despite the advances in text-to-image (T2I) generation models [120, 128, 119, 130, 165], it remains challenging to obtain images that meticulously conform to users’ intentions [106, 33, 123, 78, 86, 87, 124]. Current models often fail to compose multiple objects [Feng2022TrainingFreeSD, 106, 86], bind attributes to the wrong objects [Feng2022TrainingFreeSD], and struggle to generate visual text [87]. In fact, the difficulty of finding effective textual prompts has led to a myriad of websites and forums dedicated to collecting and sharing useful prompts (e.g. PromptHero, Arthub.ai, Reddit/StableDiffusion). There are also 11Please refer to [57]. 79 Prompt: “a little girl wearing a bright yellow dress and a copper crown is riding a badger through a field of flowers.” T2I Model Scoring Scores (Alignment, Aesthetic) DreamSync Alignment 0.80 Prompt: “a little girl wearing a bright yellow dress and a copper crown is riding a badger through a field of flowers.” Language Model Questions • Is there a girl? • Is the girl riding through a field of flowers? • Is the girl wearing a crown? • What color is the dress? • What animal is the girl riding? Answers • Yes • Yes • Yes • Yellow • Badger VQA Result • Yes • No • Yes • Yellow • Badger (1.00, 0.80) (0.80, 0.71) (0.80,0.73) (1.00, 0.68) Select Best Aesthetic 0.71 LoRA Finetune Aesthetic Model VQA Model Figure 4.3: DreamSync. Given a prompt, a text-to-image generation model generates multiple candidate images, which are evaluated by two VLM models: one VQA model that provides feedback on text faithfulness and the other on image aesthetics. The best image chosen by the VLMs are collected to fine tune the T2I model. This process can repeat indefinitely until convergence on feedback is achieved. 80 online marketplaces for purchasing and selling useful such commands (e.g. PromptBase). The onus to generate aesthetic images that are faithful to a user’s desires should lie with the model and not with the user. Today, there are efforts to address these challenges. For example, it is possible to manipulate attention maps based on linguistic structure to improve attribute-object binding [Feng2022TrainingFreeSD, 123]; or train reward models using human feedback to better align generations with user intent [78, 32]. Unfortunately, these methods either operate on a specific model architecture [Feng2022TrainingFreeSD, 123] or require expensive labeled human data [78, 32]. Worse, most of these methods sacrifice aesthetic appeal when optimizing for faithfulness, which we confirm in our experiments. We introduce DreamSync, a model-agnostic framework that improves T2I generation faithfulness while maintaining aesthetic appeal. Our approach extends work on fine-tuning T2I models for alignment, but does not require any human feedback. The key insight behind DreamSync is in leveraging the advances in vision-language models (VLMs), which can identify fine-grained discrepencies between the generated image and the user’s input text [50, 22]. Intuitively at a high level, our method can be thought of as a scalable version of reinforcement learning with human feedback (RLHF); just as LLaMA2 [149] was iteratively refined using human feedback, DreamSync improves T2I models using feedback from VLMs, except without the need for reinforcement learning. Given a set of textual prompts, T2I models first generates multiple candidate images per prompt. DreamSync automatically evaluates these generated images using two VLMs. The first one measures the generation’s faithfulness to the text [50, 22], while the second one measures aesthetic quality [68]. The best generations are collected and used to finetune the T2I model using parameter-efficient LoRA finetuning [49]. With the new finetuned T2I model, we repeat the entire process for multiple iterations: generate images, curate a new finetuning set, and finetune again. 81 We conduct extensive experiments with latest benchmarks and human evaluation. We experiment DreamSync with two T2I models, SDXL [108] and SD v1.4 [115]. Results on both models show that DreamSync enhance the alignment of images to user inputs and retains their aesthetic quality. Specifically, quantitative results on TIFA [51] and DSG [22] demonstrate that DreamSync is more effective than all baseline alignment methods on SD v1.4, and can yield even bigger improvements on SDXL. Human evaluation on SDXL shows that DreamSync give consistent improvement on all categories of alignment in DSG. While our study primarily focuses on boosting faithfulness and aesthetic quality, DreamSync has broader applications: it can be used to improve other characteristics of an image as long as there is an underlying model that can measure that characteristic. 4.2.2 DreamSync Our method improves alignment and aesthetics in four steps (see Figure 4.3): Sample, Evaluate, Filter, and Finetune. The high level idea is that T2I models are capable of generating interesting and varied samples. These examples are further judged by VLMs to pass qualification as faithful and aesthetic candidates for further finetuning T2I models. We next dive into each component more formally. Sample. Given a text prompt T, the text-to-image generation model G generates an image I = G(T). Generation models are randomized, and running G multiple times on the same prompt T can produce different images, which we index as {I (k)} K k=1. To improve the model’s faithfulness to text guidance, our method collects faithful examples generated by G. We use G to generate K samples of the same prompt T, so that with some probability δ > 0, a generated image I is faithful. Note that we need K = Ω(1/δ) samples for each prompt T, and DreamSync is not expected to improve totally unaligned models (with δ → 0). Prior work [66] estimates that 5–10 samples can yield a good image, and hence, δ can be thought of as roughly 0.1 to 0.2. 82 Iteration 1 Iteration 2 Iteration 3 A cube made of porcupine International Space Station flying in front of the moon A mountain stream with salmon leaping out of it Two leafs and two wallets Stable Diffusion XL The eye of the planet Jupiter DreamSync Figure 4.4: Qualitative examples of DreamSync improving image-text alignment after each iteration. LoRA fine-tuning on generated and filtered prompt-image pairs can steer the model to gradually capture more components of the text inputs. 83 Evaluate. For each text prompt T, we derive a set of NT question-answer pairs {Q(T), A(T)} that can be used to test whether a generated image I is faithful to T. We use an LLM to generate these pairs, only using the prompt T as input (with no images). Typically NT ≈ 10. We use VQA models to evaluate the faithfulness of the generation model, Fj (T, I) = 1{VQA(I, Qj (T)) = Aj (T)}, for j ∈ {1, . . . , NT }. We measure the faithfulness of a caption-image pair (T, I) given all questions and answers, using two metrics. Intuitively, we can average the number of correct answers, or we can be more strict, and only count an image as a success if all the answers are correct. Formally, the Mean score is the expected success rate SM(T, I) = 1 NT X NT j=1 Fj (T, I), and the Absolute score is the absolute success rate SA(T, I) = Y NT j=1 Fj (T, I). Filter. We combine text faithfulness and visual appeal (given by V(·)) as rewards for filtering. For a text prompt T and its corresponding synthetic image set {Ik} K k=1, we select samples that pass both VQA and aesthetic filters: C(T) = {(T, Ik) : SM(T, Ik) ≥ θFaithful, V(Ik) ≥ θAesthetic} . To avoid an imbalanced distribution where easy prompts have more samples, which could cause adversely affected image quality, we select one representative image (denoted as ˆIT ) having the highest visual appeal for each T: (T, ˆIT ) = argmax V(Ik) C(T). 84 We apply this procedure to all text prompts in our finetuning prompt set {Ti} N i=1 with Ti ∼ D, where D is a prompt distribution. After filtering, we collect a subset of examples, D(G) := S i∈{j|C(Tj )̸=∅} {(Ti , ˆITi )}, that meet our aesthetic and faithfulness criteria. Note that it is possible for C(Ti) to be empty, and we empirically show what fraction of the training data is selected in Figure 4.6. We ablate other aspects of the selection procedure in § 4.2.4.3. Finetune. After obtaining a new subset of faithful and aesthetic text-image pairs, we fine-tune our generative model G on this set. We denote the generative model after s iterations of DreamSync as Gs, such that G0 denotes the baseline model. To obtain Gs+1 we fine-tune on data generated by Gs after applying our filtering procedure as outlined above. We follow the same loss objective and fine-tuning dynamics as LoRA [49]. Let Θ(·) denote all parameters of a model, then the hypothesis class at iteration s is: Gs = n G | rank Θ(G) − Θ(Gs) ≤ R o . where R denotes the rank of weight updates and in practice we choose R = 128 to balance efficiency and image quality. Overall, the iterative training procedure is as follows: Gs+1 = argmin G∈Gs 1 |D(Gs)| X (Tj ,Ij )∈D(Gs) ℓ(G(Tj ), Ij ). (4.1) The self-training process Equation 4.1 can in principle be executed indefinitely. In practice, it repeats for three iterations at which point we observe diminishing returns. 4.2.3 Datasets and Evaluation In this section, we will introduce our training data in § 4.2.3.1 and evaluation benchmark in § 4.2.3.2. 85 A cityscape with skyscrapers and flowers growing on the sides of the buildings A dark gray cat wearing a multi colored scarf around its neck, sitting on a wall A colorful anime illustration of a woman wearing a silver necklace, standing in a field of flowers, with a rainbow in the background An intriguing photo of an old man sitting on a bench in the park, lit by the setting sun Figure 4.5: PaLM-2 generated training prompts and their corresponding images generated via DreamSync. Prompt acquisition requires no human effort. It enables us to train on more complex and diversified prompt-image pairs than found in typical datasets. 4.2.3.1 Training Data Acquisition To obtain prompts, and corresponding question-answer pairs without human-in-the-loop, we utilize the in-context learning capability of Large Language Models (LLM). We choose PaLM 212 [6] as our LLM and proceed as follows: 1. Prompt Generation. We provide five hand-crafted seed prompts as examples and then ask PaLM 2 to generate similar textual prompts. We include additional instructions that specify the prompt length, a category (randomly drawn from twelve desired categories as in [50], e.g., spatial, counting, food, animal/human, activity), no repetition, etc. We change the seed prompts and repeat the prompt generation three times. 12https://ai.google/discover/palm2/ 86 Model Alignment Text Faithfulness Visual Appeal TIFA DSG1K Mean Absolute SD v1.4 [115] No alignment 76.6 33.6 72.0 44.6 Training-Free SynGen royi 76.8 (+0.2) 34.1 (+0.5) 71.2 (–0.8) 42.4 (–2.2) StructureDiffusion [33] 76.5 (–0.1) 33.6 (+0.0) 71.9 (–0.1) 41.5 (–3.1) RL DPOK fan2023dpok 76.4 (–0.2) 33.8 (+0.2) 70.3 (–1.7) 46.5 (+1.9) DDPO [8] 76.7 (+0.1) 34.4 (+0.8) 70.0 (–2.0) 43.5 (–1.1) (ours) 77.6 (+1.0) 35.3 (+1.7) 73.2 (+1.2) 44.9 (+0.3) SDXL [108] No alignment 83.5 45.5 83.4 60.9 (ours) 85.2 (+1.7) 49.2 (+3.7) 86.3 (+2.9) 64.3 (+3.4) Table 4.9: Benchmark on Text Faithfulness and Visual Appeal. All models are sampled with the same set of four seeds, i.e. K = 4. Best scores under each backbone T2I model are highlighted in bold; gain and loss compared to base models are highlighted accordingly. significantly improve SD-XL and SD v1.4 in alignment and visual appeal across all benchmark. Additionally, does not sacrifice image quality when improving faithfulness. 2. QA Generation. Given prompts, we then use PaLM 2 again to generate question and answer pairs that we will use as input for VQA models as in TIFA [50]. 3. Filtering. We finally use PaLM 2 once more to filter out unanswerable QA pairs. Here our instruction aims to identify three scenarios: the question has multiple answers (e.g., “black and white panda” where the object has multiple colors, each color could be the answer), the answer is ambiguous (e.g., “a lot of people”) or the answer is not valid to the question. 4.2.3.2 Evaluation Benchmarks Using the previously generated prompts, we evaluate whether DreamSync can improve the T2I model performance on benchmarks that include general prompts. We consider the follow benchmarks. TIFA. To evaluate the faithfulness of the generated images to the textual input, TIFA [50] uses VQA models to check whether, given a generated image, questions about its content are answered correctly. There are 4k diverse prompts and 25k questions spread across 12 categories in the TIFA benchmark. Although there is no overlap between our training data and TIFA, we use the TIFA attributes to constrain 87 our LLM-based prompt generation. Therefore, we use TIFA to test DreamSync on in-distribution prompts. We follow TIFA and use BLIP-2 as the VQA model for evaluation. Davidsonian Scene Graph (DSG). DSG [22] exhibits the same VQA-as-evaluator insight as TIFA’s and further improves its reliability. Specifically, DSG ensures that all questions are atomic, distinct, unambiguous, and valid. To comprehensively evaluate T2I images, DSG provides 1,060 prompts covering many concepts and writing styles from different datasets that are completely independent from DreamSync’s training data acquisition stage. Not only is DSG a strong T2I benchmark, it also enables further analysis of DreamSync with out-of-distribution prompts. Furthermore, DSG uses PaLI as the VQA model for evaluation, which is different from the VQA model that we use in training (i.e., BLIP-2) and lifts the concern of VQA model bias in evaluation. We use DSG QA both automatically (with PaLI) and with human raters. 4.2.4 Experiments We explain our experimental setup in § 4.2.4.1, and showcase the efficacy of training with DreamSync and compare against other methods in § 4.2.4.2. § 4.2.4.3 analyzes our choice of rewards; § 4.2.4.4 reports results for a human study. 4.2.4.1 Experimental set-up Base Model. We evaluate DreamSync on Stable Diffusion v1.4 [115], which is also used in related work. Additionally, we consider SDXL [108], which is the current state-of-the-art open-sourced T2I model. For each prompt, we generate eight images per prompt, i.e., K = 8. Fine-grained VLM Feedback. We use feedback from two VLM models to decide what text-image pairs to keep for finetuning. We use BLIP-2 [82] as the VQA model to measure the faithfulness of generated images to textual input and and VILA [68] to measure the aesthetics measurement score. Empirically, we 88 Percentage Pass Filter 30.90% 29.77% 26.55% Iteration 1 Iteration 2 Iteration 3 DreamSync Figure 4.6: DreamSync improves faithfulness and aesthetics iteratively. More examples pass the filters with additional iterations. keep the text-image pairs whose VQA scores are greater than θFaithful = 0.9 and aesthetics score greater than θAesthatics = 0.6. If there are multiple generated images passing the threshold, we keep the one with the highest VILA score. Starting from 28,250 prompts, we find that more than 25% prompts are kept for D(G0) (for both T2I models), which we will use for finetuning. We later show that this percentage increases further as we perform additional DreamSync iterations. Baselines. We compare DreamSync with two types of methods that improve the faithfulness of T2I models: two training-free methods (StructureDiffusion [33] and SynGen [123]) and two RL-based methods (DPOK [32] and DDPO [8]). As the baselines use SD v1.4 as their backbone, we also use it with DreamSync for a fair comparison. 4.2.4.2 Benchmark Results In Table 4.9 we compare DreamSync to various state-of-the-art approaches with four random seeds. DreamSync Improves the Alignment and Aesthetics of both SDXL and SD v1.4. For SDXL [108], we show how three iterations of DreamSync improves the generation faithfulness by 1.7 point of mean 89 Rewards Text Faithfulness Visual Appeal VQA VILA - - 83.5 60.9 ✓ 84.8 61.9 ✓ 83.8 61.7 ✓ ✓ 84.7 62.8 Table 4.10: Ablation of different VLM rewards. Models are evaluated after one iteration. score and 3.7 point of absolute score on TIFA. The visual aesthetic scores after performing DreamSync improved by 3.4 points. Due to the model-agnostic nature, it is straightforward to apply DreamSync to different T2I models. We also apply DreamSync to SD V1.4 [115]. DreamSync improves faithfulness by 1.0 points of mean score and 1.7 points of absolute score on TIFA, together with a 0.3 points of VILA score improvement for aesthetics. Most prominently on DSG1K, DreamSync improve text faithfulness of SDXL by 2.9 points. DreamSync yields the best performance in terms of textual faithfulness on TIFA and DSG. This is true without sacrificing the visual appearance as shown in Table 4.9. In Figure 4.6 we report TIFA and aesthetics scores for each iteration, where we observe how DreamSync gradually improves the alignment and aesthetics of the generated images. We highlight several qualitative examples in Figure 4.4. 4.2.4.3 Analysis & Ablations Impact of VQA model on evaluation. We analyze whether using BLIP-2 as a VQA model for finetuning and for evaluation in TIFA might be the reason for the improvement by DreamSync that we have observed. To test this we use PaLI [19] to replace the BLIP-2 as the VQA in TIFA. Using SDXL as the backbone, DreamSync improves the mean score from 90.09 to 92.02 on TIFA compared to the vanilla SDXL model. This results confirms that DreamSync is in fact able to improve the textual faithfulness of T2I models. Ablating the Reward Models In Table 4.10, we present the results for an ablation study where we remove one of the VLMs during filtering and evaluate SDXL after applying one iteration of DreamSync. It 90 T2I Model Alignment Method Evaluation Dataset TIFA DSG1K SD v1.4 No alignment 0.056 -0.220 SynGen 0.149 -0.237 StructureDiffusion 0.075 -0.135 DPOK 0.067 -0.258 DDPO 0.152 -0.076 (ours) 0.168 -0.054 SD XL No alignment 0.878 0.702 (ours) 1.020 0.837 Table 4.11: Scores given by the human preference model ImageReward [159]; model scores are logits and can be negative. Models trained with outperform other baselines (higher is better), without using any human annotation. Entity Attribute Relation Global 0.705 0.767 0.741 0.903 0.689 0.720 0.697 0.884 SDXL DreamSync Figure 4.7: Human study with three raters on 1060 DSG prompts. can be seen how training with a single pillar mainly leads to an improvement in the corresponding metric, while the combination of the two VLM models leads to strong performance for both text faithfulness and visual easthatics, justifying our approach. One interesting finding is that training with both rewards, rather than VILA only, gives the highest visual appeal score. Our possible explanation is that images that align with user inputs may have higher visual appeal. ImageReward. We next test whether DreamSync yields an improvement on human preference reward models, even though DreamSync is not trained to optimize them. We use ImageReward [159] as an off-theshelf human preference model for generated images. Table 4.11 shows that DreamSync plus either SD v1.4 or SDXL increases ImageReward scores on images based on both TIFA and DSG1K. Tuning with VLM-based feedback helps align the generated images with human preferences, at least according to ImageReward. 91 4.2.4.4 Human Evaluation We conduct a user study based on DSG [22], where we ask approximately 8 fine-grained questions for each of 1060 images to external raters. These questions are divided into categories (entity, attribute, relation, global). Here in Figure 4.7, we observe consistent and statistically significant improvements comparing DreamSync to SDXL. In each category, images from DreamSync contain more components of the prompts, while excluding extraneous features. Overall, DreamSync’s images led to 3.4% more correct answers than SDXL images, from 70.9% to 74.3%. 4.2.5 Conclusion We introduce DreamSync, a versatile framework to improve text-to-image (T2I) synthesis with feedback from image understanding models. Our dual VLM feedback mechanism helps in both the alignment of images with textual input and the aesthetic quality of the generated images. Through evaluations on two challenging T2I benchmarks (with over five thousand prompts), we demonstrate that DreamSync can improve both SD v1.4 and SDXL for both alignment and visual appeal. The benchmarks also show that DreamSync performs well in both in-distribution and out-of-distributions settings. Furthermore, human ratings and a human preference prediction model largely agree with DreamSync’s improvement on benchmark datasets. For future work, one direction is to ground the feedback mechanism to give fine-grained annotations (e.g., bounding boxes to point out where in the image the misalignment lies). Another direction is to tailor the prompts used at each iteration of DreamSync to target different improvements: backpropagating VLM feedbacks to the prompt acquisition pipelines for continual learning. 92 Chapter 5 Conclusions This dissertation has explored the complexities and transformative impacts of Large Language Models (LLMs), uncovering their limitations and biases despite their potential. We highlight the critical need for fair and robust evaluation methods to measure LLM performance comprehensively. These metrics serve a dual purpose: they not only reveal where LLMs fall short but also guide efforts to refine these models for better alignment with human intentions. A thorough analysis of the data quality in LLM development, spanning both pretraining and finetuning phases, has highlighted how data integrity directly influences model outcomes. This insight has led to the proposal of rigorous data acquisition strategies aimed at improving model fairness and efficacy. We introduce various approaches to data acquisition, from pioneering human annotation tasks to the strategic generation of synthetic data, driven by AI feedback. These approaches demonstrate a clear path toward producing LLMs that are accurate and equitable. By advocating for the adoption of fair and robust evaluation metrics and deliberate data enhancement techniques, this dissertation lays the groundwork for the development of LLMs that are not only technologically advanced but also well aligned with human intentions. In summary, this dissertation represents a step towards the responsible development of LLMs, advocating the development of models that align with human values through the integration of advanced 93 evaluation metrics and intelligent data acquisition methods. It sets a precedent for future research and application in the field, encouraging ongoing efforts to ensure that the evolution of LLMs adheres to the highest standards of ethical integrity and societal benefit. 94 Bibliography [1] Arshiya Aggarwal, Jiao, Sun*, and Nanyun Peng. “Towards Robust NLG Bias Evaluation with Syntactically-diverse Prompts”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 6022–6032. doi: 10.18653/v1/2022.findings-emnlp.445. [2] Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. “Explanations for CommonsenseQA: New Dataset and Models”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 3050–3065. doi: 10.18653/v1/2021.acl-long.238. [3] David Ahn. “The stages of event extraction”. In: Proceedings of the Workshop on Annotating and Reasoning about Time and Events. 2006, pp. 1–8. [4] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. “Falcon-40B: an open large language model with state-of-the-art performance”. In: (2023). [5] David Alvarez-Melis and T. Jaakkola. “Towards Robust Interpretability with Self-Explaining Neural Networks”. In: NeurIPS. 2018. url: https: //proceedings.neurips.cc/paper/2018/hash/3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html. [6] Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. [7] R Harald Baayen. “Mixed-effects models”. In: The Oxford handbook of laboratory phonology (2012), pp. 668–677. [8] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training Diffusion Models with Reinforcement Learning. 2023. arXiv: 2305.13301 [cs.LG]. 95 [9] Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov. “Natural Language Processing with Small Feed-Forward Networks”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, Sept. 2017, pp. 2879–2885. doi: 10.18653/v1/D17-1309. [10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL]. [11] Karyn Buxman. “Humor in the OR: A Stitch in Time?” In: AORN journal 88.1 (2008), pp. 67–77. [12] Cansen Çağlayan and Murat Karakaya. “Topic-Controlled Text Generation”. In: 2021 6th International Conference on Computer Science and Engineering (UBMK). 2021, pp. 533–536. doi: 10.1109/UBMK52708.2021.9558910. [13] A. Caliskan, J. Bryson, and A. Narayanan. “Semantics derived automatically from language corpora contain human-like biases”. In: Science 356 (2017), pp. 183–186. [14] Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. “e-SNLI: Natural Language Inference with Natural Language Explanations”. In: Advances in Neural Information Processing Systems 31. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Curran Associates, Inc., 2018, pp. 9539–9549. url: http://papers.nips.cc/paper/8163-e-snli-natural-language-inference-with-natural-languageexplanations.pdf. [15] Santiago Castro, Luis Chiruzzo, Aiala Rosá, Diego Garat, and Guillermo Moncecchi. “A Crowd-Annotated Spanish Corpus for Humor Analysis”. In: Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media. Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 7–11. doi: 10.18653/v1/W18-3502. [16] Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. “Evaluation of Text Generation: A Survey”. In: CoRR abs/2006.14799 (2020). arXiv: 2006.14799. url: https://arxiv.org/abs/2006.14799. [17] J.K. Chambers, P. Trudgill, and S.R. Anderson. Dialectology. Cambridge Textbooks in Linguistics. Cambridge University Press, 1998. isbn: 9780521596466. url: https://books.google.com/books?id=9bYV43UhKssC. [18] Mingda Chen, Qingming Tang, Sam Wiseman, and Kevin Gimpel. “A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations”. In: Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 2453–2464. 96 [19] Xi Chen, Xiao Wang, Soravit Changpinyo, A. J. Piergiovanni, Piotr Padlewski, Daniel M. Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V. Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. “PaLI: A Jointly-Scaled Multilingual Language-Image Model”. In: ArXiv abs/2209.06794 (2022). [20] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. Mar. 2023. url: https://lmsys.org/blog/2023-03-30-vicuna/. [21] Delia Chiaro. The language of jokes: Analyzing verbal play. Routledge, 2006. [22] Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation. 2023. arXiv: 2310.18235 [cs.CV]. [23] Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. “Rethinking Embedding Coupling in Pre-trained Language Models”. In: International Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=xpFFI_NtgpW. [24] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling Instruction-Finetuned Language Models. 2022. doi: 10.48550/ARXIV.2210.11416. [25] Miruna-Adriana Clinciu, Arash Eshghi, and Helen Hastie. “A Study of Automatic Metrics for the Evaluation of Natural Language Explanations”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, Apr. 2021, pp. 2376–2387. doi: 10.18653/v1/2021.eacl-main.202. [26] Aaron Daniel Cohen, Adam Roberts, Alejandra Molina, Alena Butryna, Alicia Jin, Apoorv Kulshreshtha, Ben Hutchinson, Ben Zevenbergen, Blaise Hilary Aguera-Arcas, Chung-ching Chang, Claire Cui, Cosmo Du, Daniel De Freitas Adiwardana, Dehao Chen, Dmitry (Dima) Lepikhin, Ed H. Chi, Erin Hoffman-John, Heng-Tze Cheng, Hongrae Lee, Igor Krivokon, James Qin, Jamie Hall, Joe Fenton, Johnny Soraker, Kathy Meier-Hellstern, Kristen Olson, Lora Mois Aroyo, Maarten Paul Bosma, Marc Joseph Pickett, Marcelo Amorim Menegali, Marian Croak, Mark Díaz, Matthew Lamm, Maxim Krikun, Meredith Ringel Morris, Noam Shazeer, Quoc V. Le, Rachel Bernstein, Ravi Rajakumar, Ray Kurzweil, Romal Thoppilan, Steven Zheng, Taylor Bos, Toju Duke, Tulsee Doshi, Vinodkumar Prabhakaran, Will Rusch, YaGuang Li, Yanping Huang, Yanqi Zhou, Yuanzhong Xu, and Zhifeng Chen. “LaMDA: Language Models for Dialog Applications”. In: arXiv. 2022. 97 [27] Dorottya Demszky, Devyani Sharma, Jonathan Clark, Vinodkumar Prabhakaran, and Jacob Eisenstein. “Learning to Recognize Dialect Features”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 2315–2338. doi: 10.18653/v1/2021.naacl-main.184. [28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [29] Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. “Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets”. In: Transactions of the Association for Computational Linguistics 5 (2017), pp. 471–486. doi: 10.1162/tacl_a_00074. [30] Daan van Esch, Tamar Lucassen, Sebastian Ruder, Isaac Caswell, and Clara Rivera. “Writing system and speaker metadata for 2,800+ language varieties”. In: Proceedings of LREC. 2022. [31] Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898. doi: 10.18653/v1/P18-1082. [32] Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. “DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models”. In: arXiv preprint arXiv:2305.16381 (2023). [33] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. 2023. arXiv: 2212.05032 [cs.CV]. [34] Besnik Fetahu, Katja Markert, and Avishek Anand. “Automated news suggestions for populating wikipedia entity pages”. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 2015, pp. 323–332. [35] Joseph L. Fleiss and Jacob Cohen. “The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability”. In: Educational and Psychological Measurement 33 (1973), pp. 613–619. [36] Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation”. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 1460–1474. doi: 10.1162/tacl_a_00437. [37] Silin Gao, Yichi Zhang, Zhijian Ou, and Zhou Yu. “Paraphrase Augmented Task-Oriented Dialog Generation”. In: ArXiv abs/2004.07462 (2020). 98 [38] Tianyu Gao, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 6894–6910. doi: 10.18653/v1/2021.emnlp-main.552. [39] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. “Shortcut learning in deep neural networks”. In: Nature Machine Intelligence 2.11 (Nov. 2020), pp. 665–673. issn: 2522-5839. doi: 10.1038/s42256-020-00257-z. [40] Joseph L. Gerken. “How Courts Use Wikipedia”. In: The Journal of Appellate Practice and Process 11 (2010), p. 191. [41] Alejandro Hallo-Carrasco, Benjamin F Gruenbaum, and Shaun E Gruenbaum. “Heat and Moisture Exchanger Occlusion Leading to Sudden Increased Airway Pressure: A Case Report Using ChatGPT as a Personal Writing Assistant”. In: Cureus 15.4 (2023). [42] Rujun Han, Qiang Ning, and Nanyun Peng. “Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 434–444. [43] Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed (Ehsan) Hoque. “UR-FUNNY: A Multimodal Language Dataset for Understanding Humor”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2046–2056. doi: 10.18653/v1/D19-1211. [44] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. “DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION”. In: International Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=XPZIaotutsD. [45] John Hewitt, Christopher Manning, and Percy Liang. “Truncation Sampling as Language Model Desmoothing”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Ed. by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 3414–3427. doi: 10.18653/v1/2022.findings-emnlp.249. [46] Fred Hohman, Haekyu Park, Caleb Robinson, and Duen Horng Chau. “Summit: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations”. In: IEEE Transactions on Visualization and Computer Graphics 26 (2020), pp. 1096–1106. [47] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The Curious Case of Neural Text Degeneration. 2020. arXiv: 1904.09751 [cs.CL]. [48] Matthew Honnibal and Ines Montani. “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing”. To appear. 2017. 99 [49] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models”. In: International Conference on Learning Representations. 2022. url: https://openreview.net/forum?id=nZeVKeeFYf9. [50] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering”. In: arXiv preprint arXiv:2303.11897 (2023). [51] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. “T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation”. In: arXiv preprint arXiv:2307.06350 (2023). [52] Kuan-Hao Huang and Kai-Wei Chang. “Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs”. In: ArXiv abs/2101.10579 (2021). [53] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. “Adversarial Example Generation with Syntactically Controlled Paraphrase Networks”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Ed. by Marilyn Walker, Heng Ji, and Amanda Stent. New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 1875–1885. doi: 10.18653/v1/N18-1170. [54] Alon Jacovi and Yoav Goldberg. “Aligning Faithful Interpretations with their Social Attribution”. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 294–310. doi: 10.1162/tacl_a_00367. [55] Alon Jacovi and Yoav Goldberg. “Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 4198–4205. doi: 10.18653/v1/2020.acl-main.386. [56] Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, and Zhaopeng Tu. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. 2023. arXiv: 2301.08745 [cs.CL]. [57] Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, and Cyrus Rashtchian. “Aligning Text-to-Image Generation with Image Understanding Feedback”. In: url: https://arxiv.org/abs/2311.17946. [58] Jiao Sun, Yu Hou, Jiin Kim, and Nanyun Peng. Helpfulness and Fairness of Task-Oriented Dialogue Systems. 2023. arXiv: 2205.12554 [cs.CL]. [59] Jiao Sun, Q. Vera Liao, Michael Muller, Mayank Agarwal, Stephanie Houde, Kartik Talamadupula, and Justin D. Weisz. “Investigating Explainability of Generative AI for Code through Scenario-Based Design”. In: 27th International Conference on Intelligent User Interfaces. IUI ’22. Helsinki, Finland: Association for Computing Machinery, 2022, pp. 212–228. isbn: 9781450391443. doi: 10.1145/3490099.3511119. 100 [60] Jiao Sun, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Tagyoung Chung, Jing Huang, Yang Liu, and Nanyun Peng. “ExPUNations: Augmenting Puns with Keywords and Explanations”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 4590–4605. doi: 10.18653/v1/2022.emnlp-main.304. [61] Jiao Sun and Nanyun Peng. “Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 350–360. doi: 10.18653/v1/2021.acl-short.45. [62] Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, and Sebastian Gehrmann. “Dialect-robust Evaluation of Generated Text”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 6010–6028. doi: 10.18653/v1/2023.acl-long.331. [63] Jiao Sun, Swabha Swayamdipta, Jonathan May, and Xuezhe Ma. “Investigating the Benefits of Free-Form Rationales”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 5867–5882. doi: 10.18653/v1/2022.findings-emnlp.432. [64] Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Frederick Wieting, Nanyun Peng, and Xuezhe Ma. “Evaluating Large Language Models on Controlled Generation Tasks”. In: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023. [65] S. Jolly, K. Griffith, R. Decastro, A. Stewart, P. Ubel, and R. Jagsi. “Gender Differences in Time Spent on Parenting and Domestic Responsibilities by High-Achieving Young Physician-Researchers”. In: Annals of Internal Medicine 160 (2014), pp. 344–353. [66] Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, and Zeynep Akata. “If at First You Don’t Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection”. In: arXiv preprint arXiv:2305.13308 (2023). [67] Maxime Kayser, Oana-Maria Camburu, Leonard Salewski, Cornelius Emde, Virginie Do, Zeynep Akata, and Thomas Lukasiewicz. “e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks”. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1224–1234. doi: 10.1109/ICCV48922.2021.00128. [68] Junjie Ke, Keren Ye, Jiahui Yu, Yonghui Wu, Peyman Milanfar, and Feng Yang. VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining. 2023. arXiv: 2303.14302 [cs.CV]. 101 [69] Shawn Khan, Abirami Kirubarajan, Tahmina Shamsheri, Adam Clayton, and Geeta Mehta. “Gender bias in reference letters for residency and academic medicine: a systematic review”. In: Postgraduate Medical Journal (2021). [70] Been Kim. “Interactive and interpretable machine learning models for human machine collaboration”. PhD thesis. Massachusetts Institute of Technology, 2015. url: https://dspace.mit.edu/handle/1721.1/98680. [71] K. Kousha and M. Thelwall. “Are wikipedia citations important evidence of the impact of scholarly articles and books?” In: Journal of the Association for Information Science and Technology 68 (2017). [72] A. Kumar, Kabir Ahuja, Raghuram Vadapalli, and P. Talukdar. “Syntax-Guided Controlled Generation of Paraphrases”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 330–345. [73] Po-Nien Kung and Nanyun Peng. “Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning”. In: ACL 2023 (2023). [74] Kushal Lakhotia, Bhargavi Paranjape, Asish Ghoshal, Scott Yih, Yashar Mehdad, and Srini Iyer. “FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3712–3727. doi: 10.18653/v1/2021.emnlp-main.301. [75] Claudia Lange. The syntax of spoken Indian English. John Benjamins Publishing Company Amsterdam, 2012. [76] Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq R. Joty, and J. Huang. “A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets”. In: 2023. [77] Beatriz R Lavandera. “Where does the sociolinguistic variable stop?” In: Language in society 7.2 (1978), pp. 171–182. [78] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. “Aligning text-to-image models using human feedback”. In: arXiv preprint arXiv:2302.12192 (2023). [79] Tao Lei, Regina Barzilay, and Tommi Jaakkola. “Rationalizing Neural Predictions”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 107–117. doi: 10.18653/v1/D16-1011. [80] T. Leopold. “Gender Differences in the Consequences of Divorce: A Study of Multiple Outcomes”. In: Demography 55 (2018), pp. 769–797. 102 [81] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel-rahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Annual Meeting of the Association for Computational Linguistics. 2019. [82] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models”. In: ArXiv abs/2301.12597 (2023). [83] Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. “Contrastive decoding: Open-ended text generation as optimization”. In: arXiv preprint arXiv:2210.15097 (2022). [84] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. “Diffusion-LM Improves Controllable Text Generation”. In: Advances in Neural Information Processing Systems. Ed. by Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho. 2022. url: https://openreview.net/forum?id=3s9IrEsjLyk. [85] Zachary C Lipton. “The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery.” In: Queue 16.3 (2018), pp. 31–57. url: https://queue.acm.org/detail.cfm?id=3241340. [86] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. “Compositional Visual Generation with Composable Diffusion Models”. In: ArXiv abs/2206.01714 (2022). [87] Rosanne Liu, Daniel H Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, R. J. Mical, Mohammad Norouzi, and Noah Constant. “Character-Aware Models Improve Visual Text Rendering”. In: ArXiv abs/2212.10562 (2022). [88] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [89] Chi-kiu Lo. YiSi: A semantic machine translation evaluation metric for evaluating languages with different levels of available resources. 2018. [90] Mingyu Derek Ma, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, and Nanyun Peng. “EventPlus: A Temporal Event Understanding Pipeline”. In: 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track. 2021. [91] Juan Madera, Mikki Hebl, and Randi Martin. “Gender and Letters of Recommendation for Academia: Agentic and Communal Differences”. In: The Journal of applied psychology 94 (Nov. 2009), pp. 1591–9. doi: 10.1037/a0016539. [92] Bodhisattwa Prasad Majumder, Oana-Maria Camburu, Thomas Lukasiewicz, and Julian McAuley. “Rationale-Inspired Natural Language Explanations with Commonsense”. In: ArXiv abs/2106.13876 (2021). 103 [93] Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. “Typical decoding for natural language generation”. In: arXiv preprint arXiv:2202.00666 (2022). [94] Tristan Miller, Christian Hempelmann, and Iryna Gurevych. “SemEval-2017 Task 7: Detection and Interpretation of English Puns”. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, Aug. 2017, pp. 58–68. doi: 10.18653/v1/S17-2005. [95] Anirudh Mittal, Pranav Jeevan P, Prerak Gandhi, Diptesh Kanojia, and Pushpak Bhattacharyya. ““So You Think You’re Funny?”: Rating the Humour Quotient in Standup Comedy”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 10073–10079. doi: 10.18653/v1/2021.emnlp-main.789. [96] Anirudh Mittal, Yufei Tian, and Nanyun Peng. “AmbiPun: Generating Humorous Puns with Ambiguous Context”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, July 2022, pp. 1053–1062. doi: 10.18653/v1/2022.naacl-main.77. [97] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. “A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, June 2016, pp. 839–849. doi: 10.18653/v1/N16-1098. [98] Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. “WT5?! Training Text-to-Text Models to Explain their Predictions”. In: ArXiv abs/2004.14546 (2020). url: https://arxiv.org/abs/2004.14546. [99] John Nerbonne. “Data-Driven Dialectology”. In: Lang. Linguistics Compass 3 (2009), pp. 175–198. [100] Almira Osmanovic-Thunström, Steinn Steingrímsson, and Almira Osmanovic Thunström. “Can GPT-3 write an academic paper on itself, with minimal human input?” In: 2023. url: https://api.semanticscholar.org/CorpusID:262231775. [101] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. 2022. arXiv: 2203.02155 [cs.CL]. [102] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, July 2002, pp. 311–318. doi: 10.3115/1073083.1073135. 104 [103] Bhargavi Paranjape, Julian Michael, Marjan Ghazvininejad, Hannaneh Hajishirzi, and Luke Zettlemoyer. “Prompting Contrastive Explanations for Commonsense Reasoning Tasks”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, Aug. 2021, pp. 4179–4192. doi: 10.18653/v1/2021.findings-acl.366. [104] Jiaxin Pei and David Jurgens. “Quantifying Intimacy in Language”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 5307–5326. [105] Jeffrey Pennington, Richard Socher, and Christopher Manning. “GloVe: Global Vectors for Word Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [106] Vitali Petsiuk, Alexander E Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A Plummer, Ori Kerret, et al. “Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark”. In: arXiv preprint arXiv:2211.12112 (2022). [107] Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. “Mauve: Measuring the gap between neural text and human text using divergence frontiers”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 4816–4828. [108] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. 2023. arXiv: 2307.01952 [cs.CV]. [109] Maja Popović. “chrF: character n-gram F-score for automatic MT evaluation”. In: Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 392–395. doi: 10.18653/v1/W15-3049. [110] Amy Pu, Hyung Won Chung, Ankur Parikh, Sebastian Gehrmann, and Thibault Sellam. “Learning Compact Metrics for MT”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 751–762. doi: 10.18653/v1/2021.emnlp-main.58. [111] James Pustejovsky, José M Castano, Robert Ingria, Roser Sauri, Robert J Gaizauskas, Andrea Setzer, Graham Katz, and Dragomir R Radev. “TimeML: Robust specification of event and temporal expressions in text.” In: New directions in question answering 3 (2003), pp. 28–34. [112] James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et al. “The timebank corpus”. In: Corpus linguistics. Vol. 2003. Lancaster, UK. 2003, p. 40. 105 [113] Lihua Qian, Lin Qiu, Weinan Zhang, Xin Jiang, and Yong Yu. “Exploring Diverse Expressions for Paraphrase Generation”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Ed. by Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3173–3182. doi: 10.18653/v1/D19-1313. [114] Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. “COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics”. In: Advances in Neural Information Processing Systems. Ed. by Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho. 2022. url: https://openreview.net/forum?id=TiZYrQ-mPup. [115] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. 2021. arXiv: 2103.00020 [cs.CV]. [116] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9. [117] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: Journal of Machine Learning Research 21.140 (2020), pp. 1–67. url: http://jmlr.org/papers/v21/20-074.html. [118] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. “Explain Yourself! Leveraging Language Models for Commonsense Reasoning”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4932–4942. doi: 10.18653/v1/P19-1487. [119] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. “Hierarchical Text-Conditional Image Generation with CLIP Latents”. In: ArXiv abs/2204.06125 (2022). [120] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. “Zero-Shot Text-to-Image Generation”. In: ArXiv abs/2102.12092 (2021). [121] Sudha Rao and Joel Tetreault. “Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 129–140. doi: 10.18653/v1/N18-1012. [122] Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. “Event2Mind: Commonsense Inference on Events, Intents, and Reactions”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 463–473. 106 [123] Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. 2023. arXiv: 2306.08877 [cs.CL]. [124] Royi Rassin, Shauli Ravfogel, and Yoav Goldberg. DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. 2022. arXiv: 2210.10606 [cs.CL]. [125] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. “COMET: A Neural Framework for MT Evaluation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 2685–2702. doi: 10.18653/v1/2020.emnlp-main.213. [126] Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation. 2022. doi: 10.48550/ARXIV.2210.00193. [127] Suzanne Romaine. “On the Problem of Syntactic Variation: A Reply to Beatriz Lavandera and William Labov. Sociolinguistic Working Paper Number 82.” In: (1981). [128] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. “High-Resolution Image Synthesis with Latent Diffusion Models”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 10674–10685. [129] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. “Automatic keyword extraction from individual documents”. In: Text mining: applications and theory 1.1-20 (2010), pp. 10–1002. [130] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”. In: ArXiv abs/2205.11487 (2022). [131] Malik Sallam. “ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns”. In: Healthcare 11.6 (2023). issn: 2227-9032. doi: 10.3390/healthcare11060887. [132] Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. “Social Bias Frames: Reasoning about Social and Power Implications of Language”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 5477–5490. doi: 10.18653/v1/2020.acl-main.486. [133] Thibault Sellam, Dipanjan Das, and Ankur Parikh. “BLEURT: Learning Robust Metrics for Text Generation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 7881–7892. doi: 10.18653/v1/2020.acl-main.704. 107 [134] Koustuv Sinha, Jon Gauthier, Aaron Mueller, Kanishka Misra, Keren Fuentes, Roger Levy, and Adina Williams. “Language model acceptability judgements are not always robust to context”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 6043–6063. doi: 10.18653/v1/2023.acl-long.333. [135] Eric Michael Smith, Diana Gonzalez-Rico, Emily Dinan, and Y-Lan Boureau. Controlling Style in Generated Dialogue. 2020. arXiv: 2009.10855 [cs.CL]. [136] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013, pp. 1631–1642. url: https://www.aclweb.org/anthology/D13-1170. [137] Dirk Speelman, Kris Heylen, and Dirk Geeraerts. Mixed-effects regression models in linguistics. Springer, 2018. [138] Chris Stokel-Walker. “CHATGPT listed as author on research papers: Many scientists disapprove”. In: Nature 613.7945 (2023), pp. 620–621. doi: 10.1038/d41586-023-00107-z. [139] Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. “A contrastive framework for neural text generation”. In: arXiv preprint arXiv:2202.06417 (2022). [140] Yixuan Su and Jialu Xu. “An Empirical Study On Contrastive Search And Contrastive Decoding For Open-ended Text Generation”. In: arXiv preprint arXiv:2211.10797 (2022). [141] Jiao Sun, Xuezhe Ma, and Nanyun Peng. “AESOP: Paraphrase Generation with Adaptive Syntactic Control”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Ed. by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 5176–5189. doi: 10.18653/v1/2021.emnlp-main.420. [142] Jiao Sun, Swabha Swayamdipta, Jonathan May, and Xuezhe Ma. “Investigating the Benefits of Free-Form Rationales”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Ed. by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 5867–5882. doi: 10.18653/v1/2022.findings-emnlp.432. [143] M. Szumilas. “Explaining odds ratios.” In: Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l’Academie canadienne de psychiatrie de l’enfant et de l’adolescent 19 3 (2010), pp. 227–9. [144] Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. “QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 5941–5946. doi: 10.18653/v1/D19-1608. 108 [145] Chenhao Tan. “On the Diversity and Limits of Human Explanations”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, July 2022, pp. 2173–2188. doi: 10.18653/v1/2022.naacl-main.158. [146] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca. 2023. [147] Yufei Tian and Nanyun Peng. “Zero-shot Sonnet Generation with Discourse-level Planning and Aesthetics Features”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Ed. by Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz. Seattle, United States: Association for Computational Linguistics, July 2022, pp. 3587–3597. doi: 10.18653/v1/2022.naacl-main.262. [148] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open and efficient foundation language models”. In: arXiv preprint arXiv:2302.13971 (2023). [149] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: ArXiv abs/2307.09288 (2023). [150] Stojan Trajanovski, Chad Atalla, Kunho Kim, Vipul Agarwal, Milad Shokouhi, and Chris Quirk. “When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers. Online: Association for Computational Linguistics, June 2021, pp. 1–9. doi: 10.18653/v1/2021.naacl-industry.1. [151] Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. ““Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters”. In: Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-Findings). 2023. [152] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. “Chain of Thought Prompting Elicits Reasoning in Large Language Models”. In: CoRR abs/2201.11903 (2022). arXiv: 2201.11903. url: https://arxiv.org/abs/2201.11903. 109 [153] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. “Neural text generation with unlikelihood training”. In: arXiv preprint arXiv:1908.04319 (2019). [154] Orion Weller and Kevin Seppi. “Humor Detection: A Transformer Gets the Last Laugh”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3621–3625. doi: 10.18653/v1/D19-1372. [155] M Farr Whiteman. Writing: The nature, development, and teaching of written communication. Routledge, 2013. [156] Sarah Wiegreffe and Ana Marasovic. “Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing”. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021. url: https://openreview.net/forum?id=ogNcxJn32BZ. [157] Sarah Wiegreffe, Ana Marasović, and Noah A. Smith. “Measuring Association Between Labels and Free-Text Rationales”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 10266–10284. doi: 10.18653/v1/2021.emnlp-main.804. [158] John Wieting and Kevin Gimpel. “ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 451–462. doi: 10.18653/v1/P18-1042. [159] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. “Imagereward: Learning and evaluating human preferences for text-to-image generation”. In: arXiv preprint arXiv:2304.05977 (2023). [160] Nan Xu, Chunting Zhou, Asli Celikyilmaz, and Xuezhe Ma. “Look-back Decoding for Open-Ended Text Generation”. In: arXiv preprint arXiv:2305.13477 (2023). [161] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 483–498. doi: 10.18653/v1/2021.naacl-main.41. [162] Zhiquan Ye, Qian Chen, Wen Wang, and Zhenhua Ling. “Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models”. In: ArXiv abs/1908.06725 (2019). url: https://arxiv.org/abs/1908.06725. [163] Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Jason Wu. “Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning”. In: ACL 2023 (2023). 110 [164] A. Young, Ari D. Wigdor, and Gerald Kane. “It’s Not What You Think: Gender Bias in Information about Fortune 1000 CEOs on Wikipedia”. In: ICIS. 2016. [165] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation”. In: ArXiv abs/2206.10789 (2022). [166] Mo Yu, Matthew R. Gormley, and Mark Dredze. “Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction”. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, May 2015, pp. 1374–1379. [167] Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song. “A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models”. In: ArXiv abs/2201.05337 (2022). [168] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. url: https://openreview.net/forum?id=SkeHuCVFDr. [169] Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Bakalov, Jason Baldridge, and David Weiss. “A Fast, Compact, Accurate Model for Language Identification of Codemixed Text”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 328–337. doi: 10.18653/v1/D18-1030. [170] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. “Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 15–20. [171] Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: Less Is More for Alignment. 2023. arXiv: 2305.11206 [cs.CL]. [172] Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, and Jian Tang. “Towards Interpretable Natural Language Understanding with Explanations as Latent Variables”. In: ArXiv abs/2011.05268 (2020). [173] Wangchunshu Zhou, Yuchen Eleanor Jiang, Ethan Wilcox, Ryan Cotterell, and Mrinmaya Sachan. Controlled Text Generation with Natural Language Instructions. 2023. arXiv: 2304.14293 [cs.CL]. 111
Abstract (if available)
Abstract
Large Language Models (LLMs) are powerful and have revolutionized areas such as language understanding and generation. Given the broad impact that LLMs have been making, it is hard to systematically evaluate where the models make mistakes or underperform. This calls for a pressing need for careful and nuanced model evaluation methods. Equipped with reliable evaluation metrics that we develop, we find that existing LLMs are not perfect and could be biased towards certain demographic groups. Digging deeper into the development cycle, we find that the data quality of both the pretraining and finetuning heavily impacts the LLM performance (e.g., fairness and alignment). Therefore, it is important to ensure the data quality during the LLM development. To get good quality data, this thesis covers two ways: human annotation and careful synthetic data generation. For human annotation, we either define new tasks and work with crowd workers to ensure high annotator agreement and data quality, or conduct strategic data acquisition including scraping high-quality content to get targeted data for model training. For synthetic data generation, we rely on feedback from additional AI models to select good-quality samples to improve the model quality iteratively. The overarching goal of this thesis is to advance responsible LLM development by building robust evaluation metrics and developing smart data acquisition techniques.Ultimately, this aims to ensure alignment with human values and needs in the evolving landscape of artificial intelligence.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards generalized event understanding in text via generative models
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Aggregating symbols for language models
PDF
Robust and generalizable knowledge acquisition from text
PDF
Fairness in natural language generation
PDF
Automatic evaluation of open-domain dialogue systems
PDF
Building generalizable language models for code processing
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Grounding language in images and videos
PDF
Identifying and mitigating safety risks in language models
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Evaluating and improving the commonsense reasoning ability of language models
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Incorporating large-scale vision-language corpora in visual understanding
Asset Metadata
Creator
Sun, Jiao
(author)
Core Title
Emphasizing the importance of data and evaluation in the era of large language models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
07/30/2024
Defense Date
07/25/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,data,evaluation,large language models,natural language processing,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ma, Xuezhe (
committee chair
), Ferrara, Emilio (
committee member
), May, Jonathan (
committee member
), O'Leary, Dan (
committee member
), Peng, Nanyun (
committee member
)
Creator Email
jiao.sun.rabbit@gmail.com,jiaosun@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998H73
Unique identifier
UC113998H73
Identifier
etd-SunJiao-13295.pdf (filename)
Legacy Identifier
etd-SunJiao-13295
Document Type
Dissertation
Format
theses (aat)
Rights
Sun, Jiao
Internet Media Type
application/pdf
Type
texts
Source
20240730-usctheses-batch-1188
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
data
evaluation
large language models
natural language processing