Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Responsible artificial intelligence for a complex world
(USC Thesis Other)
Responsible artificial intelligence for a complex world
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Responsible Artificial Intelligence for a Complex World by Ninareh Mehrabi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2022 Copyright 2023 Ninareh Mehrabi To my mom, and all the women who faced discrimination. ii Acknowledgments First and foremost, I want to thank my advisor, Aram Galstyan, who not only guided me on how to conduct research and be a successful PhD student, but was also a great mentor outside the academic world for which I am always grateful. Secondly, I want to thank my co-advisor Fred Morstatter who always believed in my ideas from day one! I also want to thank Ahmad Beirami who although was not officially my advisor, I consider him as one of my advisors who always supported and guided me through various stages of my PhD journey. I want to thank all my supportive friends including but not limited to Sarik Ghazarian who was like a sister to me and Palash Goyal who was my role model. In addition, I thank all my collaborators: Nanyun Peng, Kristina Lerman, Ahmad Beirami, Xiang Ren, Muhammad Naveed, Jay Pujara, Greg Ver Steeg, Thamme Gowda, Umang Gupta, Pei Zhou, and Yuzhong Huang. Research can not be performed alone and without the help of these people doing research would not have been possible. I also thank Mozhdeh Gheini who was my go to PhD Wikipedia and helped me with many different questions. I am thankful to the committee members: Aram Galstyan, Fred Morstatter, Kristina Lerman, Morteza Dehghani, Aiichiro Nakano. I am also grateful for all the help that Peter Zamar from ISI and Lizsl De Leon from USC provided to me. I also thank my managers and mentors from Amazon who hosted me as a summer intern to grow during my PhD journey. In addition, I thank USC + Amazon Center for Secure and Trusted Machine Learning for providing me with the fellowship and supporting a portion of my PhD studies and research. I finally thank my mother who always supported me through hardships and happiness, and all the women who fight for freedom to turn sadness and dark nights into happiness and bright days. I dedicate my thesis to these women who inspire me and aim to hold the world responsible for all the existing irresponsibilities in it. iii TableofContents Dedication ii Acknowledgments iii ListofTables vii ListofFigures ix Abstract xii Chapter1: Introduction 1 Chapter2: Background: OverviewofConcepts 5 2.1 Bias and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Preliminary Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Fairness Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Fairness Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.4 Benchmark Datasets Widely Used . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Data Poisoning Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter3: IdentifyingandMitigatingBiastoImproveFairness 9 3.1 Bias in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Bias in Commonsense Knowledge Bases . . . . . . . . . . . . . . . . . . . 10 3.1.2 Effects on Downstream Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.3 Mitigation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Bias in Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Exclusion Bias in Community Detection Algorithms . . . . . . . . . . . . 15 3.2.2 Mitigation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Bias in Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Gender Bias in Named Entity Recognition Systems . . . . . . . . . . . . . 19 3.3.2 Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter4: AnalyzingandImprovingRobustnesstoSocietalConcerns 26 4.1 Robustness to Unfairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.1 Attacks on Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iv 4.1.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Robustness to Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Attacks on Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.2 Attack Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.3 Defenses on Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.4 Defense Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter5: InterpretabilityasaMeanstoImproveFairness 43 5.1 Problematic Feature Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Bias Mitigation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Validating Attribution Framework . . . . . . . . . . . . . . . . . . . . . . 46 5.3.2 Validating Bias Mitigation Strategy . . . . . . . . . . . . . . . . . . . . . . 48 5.3.3 Validating Applicability on Textual Data . . . . . . . . . . . . . . . . . . . 50 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter6: ResolvingAmbiguitythroughHumanFeedback 52 6.1 Benchmark Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2 Disambiguation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter7: InternalizingEthicsinAISystems 62 7.1 Broader Impact Dataset (BID) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.1.1 BID-Gen: A Dataset for Generating Broader Impact Statements . . . . . . 63 7.1.2 BID-QA: A Dataset for Answering Ethical Questions . . . . . . . . . . . . 63 7.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2.1 Generating Hypothesis (Task 1) . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.2.2 Answering Ethical Questions (Task 2) . . . . . . . . . . . . . . . . . . . . 67 7.2.2.1 Standard Experiments . . . . . . . . . . . . . . . . . . . . . . . . 68 7.2.2.2 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . 68 7.2.2.3 Results on Standard Experiments . . . . . . . . . . . . . . . . . 69 7.2.2.4 Results on Additional Experiments . . . . . . . . . . . . . . . . 69 7.3 Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 v Chapter8: RelatedWork 72 8.1 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 8.1.1 Fairness Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 8.1.2 Methods for Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8.2.1 Robustness in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 74 8.2.2 Robustness in Natural Language Processing . . . . . . . . . . . . . . . . . 74 8.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8.4 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter9: Discussion 76 9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Bibliography 79 vi ListofTables 3.1 Mitigation results from the pre-processing filtering technique (COMeT-Filtered) compared to standard COMeT. COMeT-Filtered is effective at reducing overgeneralization and disparity according to sentiment and regard measures along with human evaluation results. Despite the effectiveness of the filtering approach in reducing bias, the quality of generated triples are reduced compared to COMeT. Hence, the quality of generation is compromised over reduction in bias. 14 3.2 Counts along with examples from information that is missed by two traditional community detection algorithms demonstrating the effect of the exclusion bias. . 16 3.3 Results obtained from calculating F1 and Jaccard similarity scores with regards to the ground truth labels for each of the algorithms along with the number of significantly labeled users indicating the effect on mitigating exclusion bias. . . . 19 3.4 Percentage of female and male names from the census data appearing in CoNLL 2003 and OntoNotes datasets with their corresponding counts. Both datasets fail to reflect the variety of female names. . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1 Difference of the True Positive Rates (TPRD) amongst different genders for the dentist and nurse occupations on the biosbias dataset. Our introduced post-processing method is the most effective in reducing the disparity for both occupations compared to the pre-processing technique. . . . . . . . . . . . . . . . 50 6.1 Breakdown of our benchmark dataset by ambiguity types. The benchmark consists of six types of ambiguities, including linguistic and fairness. We cover syntactic as well as discourse type ambiguities for linguistic type ambiguities. It also contains complex version for subset of the samples from the main type ambiguities with structurally more complex sentences, combination cases that combine fairness and linguistic type ambiguities, and some miscellaneous cases. . 54 6.2 BLEU and ROUGE scores obtained by different LMs on generating a clarifying question in 6-shot setup given an ambiguous prompt.↑ indicates that higher values are desired. Scores are reported on a 0-1 scale. . . . . . . . . . . . . . . . . 55 6.3 Automatic results from language models directly generating multiple visual setups. 56 7.1 Breakdown of our BID-Gen dataset containing research papers with broader impact statements into train, validation, and test splits. . . . . . . . . . . . . . . . 63 7.2 Seven pre-defined questions used to curate the BID-QA dataset for Task 2. . . . . 64 7.3 Detailed statistics of the BID-QA data including the existing different sets in it along with BID-QA-extract set’s train, test, validation split statistics. . . . . . . . . 65 7.4 Results reported in terms of different scores on the task of generating hypothesis or broader impact statements (task 1) for different language models and strategies. 67 vii 7.5 Results on BID-QA-extract question answering test and validation sets according to different metrics over various models. These results show that scores can improve drastically when these models are further fine-tuned on our BID-QA-extract train set compared to only fine-tuning on the SQuAD2.0 dataset. 68 viii ListofFigures 3.1 Negative and positive regard and sentiment results from ConceptNet and GenericsKB. We find outlier target groups with high regard and sentiment percentages that show the severity of intra-targetovergeneralization. We also find large overgeneralizationdisparity for groups in the same category indicated by the span of boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Qualitative examples representing targets from the “Profession” and “Religion” categories labeled by the regard measure. Colored regions indicate favoritism, prejudice, both prejudice and favoritism, and somewhat neutral. Higher negative regard percentages indicate prejudice-leaning and higher positive regard percentages indicate favoritism-leaning. We also compare ConceptNet (Speer et al., 2017) and GenericsKB (Bhakthavatsalam et al., 2020) on the “Religion” category by putting both of the results side by side and find similar polarized perceptions for certain target groups, despite a much larger percentage range for GenericsKB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Sentiment and regard results showing overgeneralization issues in COMeT and CSG. 13 3.4 Weighted error results obtained from different models that spanned the 139-year history of baby names from the census data on different error types. Female names have higher error rates for all the cases. The y axis represents the calculated error rates for each of the error types, and the x axis represents the year in which the baby name was coming from. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Unweighted error results obtained from different models that spanned the 139-year history of baby names from the census data on different error types. Female names have higher error rates for most of the cases. The y axis represents the calculated error rates for each of the error types, and the x axis represents the year in which the baby name was coming from. . . . . . . . . . . . . . . . . . . . 23 3.6 Error Type-1 Weighted score results obtained from different models on different templates from our benchmark for female and male names collected over 139 years. Context in some of the templates helped some models, but had negative effects on other models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 The figure on the left represents before anchoring attack is applied. The right figure represents how the anchoring attack is performed in which poisoned points are located in close vicinity (drawn as the large solid circle) of target points. . . . 28 4.2 Results demonstrating the effect of different attacks on different fairness (SPD and EqOpp) and accuracy (test error) measures over three different datasets (German Credit, COMPAS, and Drug Consumption) with different 𝜖 values. . . . . . . . . . 31 ix 4.3 Results demonstrating the effect of different lambda values in IAF with regards to different fairness (SPD and EqOpp) and accuracy (test error) measures on three different datasets (German Credit, COMPAS, and Drug Consumption) for different 𝜖 values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 UTSC attack methodology steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Attack effectiveness by toxicity classifier. . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Human evaluation results for various attacks. Black dotted line represents the average score for a given quality ranging from 1 to 3 indicating bad to good quality. Each bar plot demonstrates proportion of workers that rated a particular score for a given quality (red for bad, yellow for moderate, and green for good). For toxicity, we only have two ratings (toxic and not toxic). . . . . . . . . . . . . . 36 4.7 Transferability of UTSC attacks among different toxicity classifiers. The adversary uses Toxic-bert to conduct its attack; however, results transfer to Perspective API and Safety classifier as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.8 The two-stage defense framework including interpretable reasonings at levels 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.9 Effectiveness of different defense mechanisms against different attack strategies using different toxicity classifiers according to Toxic-bert classifier. . . . . . . . . 39 4.10 Human experimental results performed over various defense mechanisms. Black dotted line represents the average score for a given quality ranging from 1 to 3 indicating bad to good quality. Each bar plot demonstrates proportion of workers that rated a particular score (red for bad, yellow for moderate, and green for good). Toxicity ratings are binary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.11 Transferability of our introduced defense mechanism to the Perspective API and Safety classifier for different attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.12 Transferability of our introduced defense mechanism to Perspective API and Safety classifiers for different attack strategies using different toxicity classifiers during the attack process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.13 Number of toxic response generations before and after our introduced defense was applied to GPT-2 from the RealToxicityPrompts dataset (Gehman et al., 2020). The defense is demonstrated to reduce the number of toxic generations in GPT-2. Results on Toxic-bert show the defense results, while results on Perspective API and Safety classifier establish the transferability of our defense. . . . . . . . . . . 42 5.1 (a) In general classification model, for each feature 𝑓 𝑘 a vector representation𝑒 𝑘 of length𝑑 𝑒 is learned. These vector representations are passed to the attention layer which produces a𝑑 𝑒 -dimensional vector representation for the sample instance 𝑖. This sample instance representation is finally passed to two dense layers to get the final classification output. (b) The attribution framework has the same architecture as the general classification model. One outcome is obtained through the original model and another through the model that has some attention weights zeroed. The observed differences in accuracy and fairness measures will indicate the effect of the zeroed out features on accuracy and fairness. . . . . . . . 44 5.2 Results from the synthetic datasets analyzing validity of the attribution framework in various scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 x 5.3 Accuracy vs parity curves for UCI Adult and Heritage Health datasets. . . . . . . 49 5.4 Qualitative results from the non-tabular data experiment on the job classification task based on bio texts. Green regions are the top three words used by the model for its prediction based on the attention weights. While the Not Debiased Model mostly focuses on gendered words, our method focuses on profession-based words, such as R.N. (Registered Nurse), to correctly predict “nurse.” . . . . . . . . 51 6.1 Our proposed disambiguation framework. The initial ambiguous prompt is disambiguated by either (1) the language model generating clarifying question which will be resolved through human provided answers, or (2) the language model generating different possible visual setups and human-agent choosing the desired setup. The final disambiguated prompt will later be provided to the downstream text-to-image generative model. . . . . . . . . . . . . . . . . . . . . . 53 6.2 Percentage of generations by GPT-neo that were successful according to human-agent and its comparison to BLEU and ROUGE automatic metrics for the one clarifying question generation setup. . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Percentage of generations by GPT-neo that were successful according to human-agent and its comparison to BLEU and ROUGE automatic metrics for the multiple visual setup generations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4 Introduced automatic evaluation framework using a VQA model. Given an ambiguous and a disambiguated prompt, we compare the generations made by the text-to-image model according to a VQA model. As inputs to the VQA model, we provide the human intention in the question format as well as the generated images from each prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.5 Percentages of faithful image generations by DALL-E Mega and OpenAI’s DALL-E according to automatic evaluation using a VQA model. . . . . . . . . . . . . . . . 59 6.6 Percentage of faithful generations by DALL-E Mega and OpenAI’s DALL-E from human evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.7 Fairness qualitative examples from OpenAI’s DALL-E (top row) and DALL-E Mega (bottom row). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.1 Results from various models fine-tuned on SQuAD2.0 and our data on different test sets from Task 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Results comparing the multitask setup vs separately fine-tuning for each task using the T5 model. Results for generating hypothesis/broader impact statements (Task 1) is demonstrated on the left side. Results for answering ethics related questions using the BID-QA-extract set (Task 2) is demonstrated on the right side. 70 xi Abstract With the advancement of Artificial Intelligence (AI) and its omnipresent role in different appli- cations, it is crucial to ensure that AI systems comply with responsible practices. Moreover, the environment in which AI systems learn and interact with contains various external factors that might adversely affect their behavior. Thus, those systems should be able to mitigate potentially negative impacts of such factors. This dissertation explores several important dimensions that are essential for designing responsible AI systems. First, we focus on fairness as a central concept for responsible AI systems and analyze existing biases in various data sources and models. Moreover, we describe a framework based on interpretability for generating fair and equitable outcomes. Second, we discuss robustness to external perturbations as another important property for such systems. Next, we discuss human-centered AI systems which take natural language prompts as input, demonstrate possible issues due to ambiguous interpretation of those prompts, and describe a framework for resolving such ambiguities and generating faithful outcomes to human intention. Finally, we discuss ideas for designing AI systems that can internalize ethics and form a realization about the consequences of tasks and design choices associated with them. We hope that the contributions presented in this dissertation will move us closer to having more responsible AI systems. xii Chapter1 Introduction Theworldwillnotbedestroyedbythosewhodoevil, butbythosewhowatchthemwithoutdoinganything. –Albert Einstein The incorporation of AI systems into our daily lives is becoming more widespread, with some of systems being deployed and utilized in various sensitive environments and scenarios where they can potentially affect individuals. Thus, the requirement that those systems behave in a socially responsible way has become imperative. This dissertation focuses on the following two overarching questions: 1. What constitutes a responsible AI system? 2. How do we develop responsible AI systems operating in an environment with external factors that might adversely affect the behavior of these systems? The term responsible AI has various connotations and it is hard to have a universal notion of what a responsible AI system should adhere to, as indicated in (Cheng et al., 2021). In this dissertation, we adopt a pragmatic definition of responsible AI systems by focusing on a set of related concepts and components from fair, robust, interpretable, human-centered, and ethical AI literatures. Fair AI. One of the most important practices in designing a responsible AI system is to ensure that the system is fair and avoids any discriminatory or biased behavior in the form of favoritism or prejudice towards an individual or a group based on their inherent or acquired characteristics (Mehrabi et al., 2021a). We are surrounded by biased and discriminatory behavior in the world, which is inevitable to some extent due to historical or social norms that have shaped our societies throughout our existence. This biased behavior can easily manifest into our AI systems from the data as well as algorithmic perspectives. Moreover, once such biases are present in AI systems, they can get amplified due to the existence of the feedback loop phenomenon, where biased outcomes from a system are fed back into the data that is used for training models, thus exacerbating the existing bias (Mehrabi et al., 2021a). The difficulty in addressing fairness issues are not limited to the feedback loop phenomenon and bias amplification, but also to the observation that various types of biases can cause different types of undesired behaviors and thus require separate remedies depending on where and how they appear. Indeed, there is no universal 1 measure of bias nor notion of algorithmic fairness, and thus, not a single mitigation strategy. For instance, addressing gender bias in named entity recognition systems might require a different remedy and measuring techniques than gender bias in machine translation systems. Similarly, ranking bias in recommendation systems might need to be addressed differently than sampling bias in data collection pipelines. In this dissertation, we will focus on some of the observed biases in data, algorithms, and models for a subset of possible types of biases in some applications. First, we will discuss existing biases in data resources, such as knowledge bases, their effects on downstream models, and mitigation strategies based on filtering the bias in the data. We will then discuss algorithmic bias and mitigation methods to address exclusion bias in community detection algorithms. Lastly, we will discuss existing biases in models, such as gender bias in named entity recognition systems in the Natural Language Processing (NLP) domain. Our goal is to observe, quantify, and mitigate the negative effects associated with unintentional issues (e.g., bias) in our systems for a more fair AI system design. RobustAI. The existence of historical and unintentional biases in the data and algorithms can create discriminatory and undesired outcomes from our systems and even amplify them in time if appropriate mitigation strategies are not taken into place. Furthermore, sometimes those undesired outcomes might be caused intentionally by an adversary. Thus, it is crucial to ensure that our systems are robust to such intentional phenomena that can trigger discriminatory or toxic behavior. There have been numerous studies on adversarial attacks that target performance measures such as classification accuracy, as well as proposals to defend AI systems against such attacks (Chakraborty et al., 2018). However, lesser attention has been paid to attacks that can potentially target metrics that are relevant to responsible AI, such as fairness. In this dissertation, we address this gap and focus on types of attacks that target such metrics. We believe that these attack dimensions are as important as more typical focus on model performance/accuracy. Furthermore, we aim to improve robustness of AI systems in different scenarios, such as attacks aimed at deteriorating fairness of discriminative ML models, and toxic content generation in dialogue systems. We extensively borrow concepts from adversarial ML to define and explore such attack dimensions. In addition, we use concepts from interpretable ML to propose defense strategies that can improve the robustness of AI systems against such attacks. InterpretableAI. Another important concept in designing responsible AI systems is inter- pretability. As AI systems are developed in sensitive environments, such as medical and banking domains, it is important to have interpretable models that can give us more information and understanding of the particular decision made by an AI model. In this dissertation, we are not focusing on the definition for characterization of interpretability, but rather borrowing existing concepts from interpretable AI literature to achieve fair and robust outcomes in a more transparent way. Our main motivation is that if interpretability can be used to explain a particular decision made by a system, it can also help to explain why that particular decision was unfair. We test this idea both on ML and NLP models and show its applicability to a wide range of applications and domains. Our goal is to utilize concepts in responsible AI literature (e.g., interpretability and fairness) to address interrelated concerns in these systems. We believe that while each of 2 these concepts can be individually helpful in designing responsible AI systems, treating them simultaneously can potentially open new horizons and directions for responsible AI research. Human-centered AI. Many AI systems are based on human-in-the-loop concept and/or are designed to interact with humans to ensure that the system output is aligned with human intentions and expectations. For instance, ambiguities in the prompts provided to AI systems by humans might need human supervision for resolving those ambiguities. Human language can contain various types of ambiguities that need resolution through additional signal or feedback provided to AI systems. This signal can be in the form of additional information, commonsense knowledge, visual signal or other forms of modalities. However, in some cases due to the nature of the ambiguity there might not be any suitable resolution other than including human in the loop for the model to seek clarity in the prompts. In this dissertation, we focus on resolving ambiguities in prompts via human feedback. More specifically, we study ambiguous prompts in the context of text-to-image generative models, where the goal is to generate an appropriate image aligned with the human intention. For instance, some prompts can contain syntactic or discourse ambiguities that can cause the model to get confused and not generate the appropriate image that captures human intention. In such cases, the model should be able to seek appropriate additional signals from the human to provide a more satisfactory outcome. In addition to satisfactory outcomes, resolving ambiguities can be beneficial in terms of fairness in the case of under-specified prompts. For instance, for an under-specified ambiguous prompt “Generate an image of a doctor”, it is unclear whether the doctor in the image should be male or female. It is possible that without seeking resolution, the model injects its own biases to generate an image of a doctor that is not aligned with human intention. These types of under-specified prompts can cause negative consequences, such as generating images wih mostly male doctors and/or female nurses. Thus, creating a human-centered AI that is capable of resolving ambiguities and under-specifications gracefully through an appropriate interaction is crucial for a responsible AI system design. EthicalAI. It is intriguing to envision AI systems that can internalize ethics and develop a realization of the potential ethical consequences of the tasks they are designed to solve, design choices, or specific algorithmic solutions. The internalization of ethics in AI systems would be a crucial part in creating an ultimate responsible AI system. Although there is a long journey to achieve an ideal version of such system, in this dissertation we explore a small step toward attaining this goal. By internalizing ethics in AI systems, we would not only create systems that can develop a realization about ethical consequences related to different actions, but we can also contribute to the design of responsible AI systems that can detect and intervene in situations where negative effects are flagged in advance. For instance, if a choice of a specific algorithmic solution can have an adverse discriminatory result, this choice can be flagged and an alternative can be proposed. Although there are numerous avenues to explore in this new direction, in this dissertation we target solving two simple problems. First, given a research proposal, we would like to be able to generate a statement about the ethical consequences of the proposal. Second, we want to test the ability of models to answer ethics related questions given the ethical statements written about a research proposal. We believe that solving these two tasks can be an important step toward addressing more challenging and ambitious research questions in this domain. 3 ThesisStatement. The environment in which AI systems learn and interact with contains many factors/agents that can either intentionally or unintentionally manipulate these systems. In order to design a responsible AI system, it is important to be able to: (1) Identify how these manipulations take place and where. (2) Quantify the issues associated with these manipulations. (3) Integrate various concepts from responsible AI literature to address interrelated issues, such as fairness, robustness, and interpretability. (4) Mitigate the effects of these manipulations. This dissertation focuses on designing responsible AI systems considering outside intentional (e.g., targeted attacks against these systems) and unintentional phenomena (e.g., existing societal and historical data biases) that can adversely affect these systems. We collectively study the existing important concepts in responsible AI literature: fairness, robustness, interpretability, human-centered AI, ethics and their intersections to improve our systems. To summarize, in this dissertation, we attempt to address issues related to responsible AI system design and methods to improve these systems. We present our findings in the following order: In Chapter 2, we review necessary background required to understand concepts discussed in the subsequent chapters related to responsible AI practices. In Chapter 3, we introduce fairness as one of the important concepts in responsible AI literature and discuss various types of biases that can exist in various stages of development from data collection, to algorithmic design, to model outcomes and human interactions. We discuss different types of biases, such a gender bias, exclusion bias, and representational harms, along with methods to quantify and mitigate them. In Chapter 4, we discuss intentional adversarial elements that can threaten robustness of AI systems to adversarial attacks that can force the model into making unfair decisions or generating toxic outputs. We also propose methods utilizing interpretability frameworks to combat some of those negative effects. In addition, we discuss how to achieve fair outcomes using interpretability frameworks introduced in Chapter 5. In Chapter 6, we discuss how humans and AI systems can interact to resolve ambiguities in the prompts provided to text-to-image generative models which can have various implications for trustworthiness of those models. In Chapter 7, we discuss initial steps toward designing ultimate responsible AI systems that can internalize ethics and form a realization about consequences of the tasks and design choices associated to them. Chapter 8 contains related work discussions around concepts in responsible AI literature and work done in this domain. Finally, Chapter 9 contains concluding remarks and some future work discussion. 4 Chapter2 Background: OverviewofConcepts This chapter will briefly provide an overview of notions and concepts required to understand various concepts in this dissertation including but not limited to various concepts and definitions related to responsible AI discussions in fairness, robustness, and interpretability. 2.1 BiasandFairness Research in fairness touches upon various topics including but not limited to defining different fairness notions that might be applicable to different AI applications and models, measuring fairness in various contexts, mitigating bias and achieving more fair outcomes, and introducing benchmark datasets specific to this domain. Here, we will discuss some of the useful initiatives that are the building blocks of the research introduced in this dissertation. Thus, having the prior overview knowledge would be helpful in understanding the concepts discussed in this paper. 2.1.1 PreliminaryNotation Most of the work in fairness domain focuses on reducing discrimination against an existing sensitive attribute in the data. For instance, for a given dataset of tripletsD ={𝑥 𝑖 ,𝑦 𝑖 ,𝑎 𝑖 } 𝑁 𝑖=1 , in which a∈{𝑎 1 ,...𝑎 𝑙 } is a discrete variable with𝑙 possible values denoting the sensitive attributes, y∈{0, 1} denoting the true label, andx∈R 𝑚 features of the sample, the goal is to achieve fairness with respect to the existing sensitive attributes (a) in the dataset. In decision making tasks, oftentimes there exist two kinds of groups. The first group, is the historically advantaged group that has the higher chance of getting the positive outcome denoted asD 𝑎 representing the advantaged demographic group. On the contrary, we have the disadvantaged demographic group which is the historically disadvantaged group that has lower chance of getting the positive outcome denoted asD 𝑑 . For instance, in college admission scenario, female applicants might be the representative of the disadvantaged group in the dataset due to historic discrimination against them into not getting admitted to certain colleges for certain majors. On the other hand, male applicants might be the representative of the advantaged demographic group as historically it was more likely for male applicants to get admitted to certain colleges. 5 With these notations and definitions in mind, we will start defining some of the well-known and used fairness notions in AI literature. 2.1.2 FairnessNotions Since fairness is subjective and very much application dependant, there is no singular and universal fairness definition. Here, we will list some of the well-known and widely used notions that are used throughout our work in this dissertation. StatisticalParity In statistical parity, the goal is to equalize the probability of getting the positive outcome amongst different demographic groups with different sensitive attributes. The mathematical definition of statistical parity is as follows: 𝑃( ˆ y= 1| a=𝑎 𝑖 )=𝑃( ˆ y= 1| a=𝑎 𝑗 ) Where ˆ y∈{0, 1} denotes the predicted label. We assume 1 is the positive outcome (e.g., getting admitted to a university). EqualityofOpportunity In equality of opportunity the goal is to equalize the true positive rates amongst different demographic groups with different sensitive attributes. The mathematical definition of equality of opportunity is as follows: 𝑃( ˆ y= 1| a=𝑎 𝑖 ,y= 1)=𝑃( ˆ y= 1| a=𝑎 𝑗 ,y= 1) EqualizedOdds In equalized odds the goal is to equalize the true positive and false positive rates amongst different demographic groups with different sensitive attributes. The mathematical definition of equalized odds is as follows: 𝑃( ˆ y= 1| a=𝑎 𝑖 ,y=𝑦)=𝑃( ˆ y= 1| a=𝑎 𝑗 ,y=𝑦) where 𝑦∈{0, 1} The list of fairness notions does not limit to the above three notions; however, we extensively used the above notions throughout this paper. 2.1.3 FairnessMeasures Oftentimes, to measure fairness, we turn the discussed fairness notions into measures. For instance, we would want to minimize the difference between the probability of assigning the positive outcome amongst different demographic groups to satisfy statistical parity notion. Thus, we ideally want the statistical parity difference to be equal to 0. We report Statistical Parity Difference (SPD) as follows: 6 SPD( ˆ y,a)= max 𝑎 𝑖 ,𝑎 𝑗 |𝑃( ˆ y= 1| a=𝑎 𝑖 )−𝑃( ˆ y= 1| a=𝑎 𝑗 )| We report Equality of Opportunity Difference (EqOpp) as follows: EqOpp( ˆ y,a,y)= max 𝑎 𝑖 ,𝑎 𝑗 |𝑃( ˆ y= 1| a=𝑎 𝑖 ,y= 1)−𝑃( ˆ y= 1| a=𝑎 𝑗 ,y= 1)| We report Equalized Odds Difference (EqOdd) as follows: EqOdd( ˆ y,a,y)= max 𝑎 𝑖 ,𝑎 𝑗 max 𝑦∈{0,1} |𝑃( ˆ y= 1| a=𝑎 𝑖 ,y=𝑦)−𝑃( ˆ y= 1| a=𝑎 𝑗 ,y=𝑦)| 2.1.4 BenchmarkDatasetsWidelyUsed There are specific datasets that are suitable to benchmark fairness work mostly because they con- tain sensitive features/attributes that can make it possible to perform various analysis considering fairness measures. The most widely used datasets that are also utilized in the works introduced in this dissertation are listed bellow. UCIAdultDataset. This dataset contains census information about individuals, with the pre- diction task being whether the income of the individual is higher than $50k or not. One of the sensitive attributes in this dataset is gender (male/female). GermanCreditDataset. This dataset comes from UCI machine learning repository (Dua and Graff, 2017). It contains the credit profile about individuals with 20 attributes associated to each data point. The classification goal is to predict whether an individual has good or bad credit. COMPASDataset. Propublica’s COMPAS dataset contains information about defendants from Broward County (Larson et al., 2016). The classification goal is to predict whether an individual will recommit a crime within two years. DrugConsumptionDataset. This dataset comes from the UCI machine learning repository (Dua and Graff, 2017). It contains information about individuals (Fehrman et al., 2017). The classification goal is to predict whether an individual has consumed cocaine or not in their lifetime. Bios Bias Dataset. The dataset contains short bios of individuals. The task is to predict the occupation of the individual from their bio. Previous work has shown that existence of certain gendered related words in the bios can bias the model to predict certain occupations (e.g., doctor is more likely to be predicted for male bios and nurse for female bios). This dataset is widely used in NLP domain. 7 2.2 Robustness To improve robustness of AI models, researchers have tried to show the vulnerabilities associated to these systems via introducing different attacks as well as defenses that can overcome the showcased vulnerabilities. There are different families of attacks that can degrade performance of an AI model. One of such attacks which is also incorporated in this dissertation is data poisoning attack. Data poisoning attacks happen during the training process. The goal is to train a malicious model via addition, removal, or placement of poisoned data instances during the train process. 2.2.1 DataPoisoningAttack Consider a supervised learning problem characterized by a loss functionL(𝘃;D) and an adver- sarial loss𝐿 𝑎𝑑𝑣 ( ˆ 𝘃;D), where ˆ 𝘃 is the set of learnable parameters andD is a labeled dataset. Let D 𝑡𝑟𝑎𝑖𝑛 be the training dataset. We assume that the adversary can poison a fraction of those data points, so thatD 𝑡𝑟𝑎𝑖𝑛 =D 𝑐 ∪D 𝑝 , whereD 𝑐 andD 𝑝 are the set of clean and poisoned data points, respectively. We assume that|D 𝑝 |=𝜖|D 𝑐 |. Furthermore,D 𝑝 ⊆F 𝛽 whereF 𝛽 is the feasible set, which is a set selected by a defense mechanism based on anomaly detection techniques, containing elements that the defender considers as sanitized data to train its model. The existence of the feasible set in the objective helps the poisoned points to blend with the natural data and make it more difficult for anomaly detector techniques to detect them (Koh et al., 2018). A data poisoning attack can be written as the following optimization problem (over the set of poisoned data points): max D 𝑝 𝐿 𝑎𝑑𝑣 ( ˆ 𝘃;D 𝑡𝑒𝑠𝑡 ) 𝑠.𝑡. |D 𝑝 |=𝜖|D 𝑐 | D 𝑝 ⊆F 𝛽 where ˆ 𝘃 = arg min 𝘃 L(𝘃;D 𝑐 ∪D 𝑝 ). In essence, the adversary attempts to maximize its test loss𝐿 𝑎𝑑𝑣 by carefully selecting poisoned data points. These types of attacks are shown to be powerful against defenders that are trying to minimize their own lossL, while the attacker is trying to harm the defense (Koh et al., 2018). 8 Chapter3 IdentifyingandMitigatingBiastoImproveFairness One of the responsible AI practices is to ensure fair and non-discriminatory behavior by the AI system via identification and mitigation of bias that can happen throughout the entire development process. Bias can manifest itself throughout the entire development process from as early as the existing historical or societal biases in the data all the way to algorithmic design choices, model release, and user interaction. We will discuss some examples of existing biases in three different components, methods to identify them, and in some cases even bias mitigation strategies in different domains. We will start by discussing the data bias and existing representational harms in some knowledge bases crowd-sourced by humans. We will then observe the negative effects that these biases in the data can have on downstream models that use them. Lastly, we will propose some solutions to mitigate this issue for the particular use-case. We will then move to discussing algorithmic bias in which some algorithmic design choices can lead into exclusion bias and new algorithmic proposals that can mitigate the previous harmful effects. Lastly, we will discuss the observed existing gender bias in some models and discuss potential risks and sources of these types of biases. 3.1 BiasinData Since AI systems are data driven and require data to be trained over, data is considered as one of the most important elements in AI systems and models. If the underlying training data used for training an AI system contains biases, the model trained on the biased data will learn these biases and reflect them into its predictions which is undesirable and will oppose to the responsible AI practices (Mehrabi et al., 2021a). One of the important sources of data are commonsense knowledge bases which are widely used in various commonsense reasoning tasks and models to infuse knowledge into AI systems. However, these data sources are oftentimes crowd-sourced by humans, and it is likely that humans would inject their own biases and harmful stereotypical views into these data sources. These harmful biases would not only remain in the data but will also migrate to the models and downstream tasks that use these sources as training data. We will not only observe the negative consequences of these migrations but also their amplification through the cycle. To abide by responsible AI practices, we should propose methods to overcome such issues to not only filter 9 the data from the polluted harms and biases but also their negative effects on downstream models and tasks which will be discussed in detail in this section. 3.1.1 BiasinCommonsenseKnowledgeBases Commonsense knowledge bases (CSKB) are important data resources that are extensively used in various NLP tasks to incorporate information about everyday situations necessary for human lan- guage understanding (Mehrabi et al., 2021c). Numerous commonsense reasoning models including but not limited to tasks, such as question answering (Lin et al., 2019) and dialogue response gener- ation (Zhou et al., 2018, 2021), have incorporated knowledge bases, such as ConceptNet (Speer et al., 2017), to train models with external sources of knowledge infused in them. Although these sources of knowledge can be valuable resources to train various models, they can contain harmful biases and stereotypes that can be problematic. This is mostly due to the fact that these knowledge resources are mostly human-generated, either crowdsourced from the public (Speer et al., 2017; Sap et al., 2019) or crawled from massive web corpora (Bhakthavatsalam et al., 2020) which can contain harmful stereotypical biases injected by human values and viewpoints. To perform the analysis, we focus on studying two widely known and used knowledge re- sources, ConceptNe (Speer et al., 2017) and GenericsKB (Bhakthavatsalam et al., 2020). Upon a preliminary exploration, we observed harmful stereotypical instances from each of these knowl- edge bases. For instance, from ConceptNet which represents knowledge in the form of triples, we discovered harmful stereotypical examples, such as “(lawyer, RelatedTo, dishonest)” and “(church, UsedFor, brain washing)”. Similarly, from GenericsKB which contains knowledge in the form of sentences, we discovered harmful examples, such as “Chinese people are very reclusive group of people” and “Lawyers are registered menace to society” (Mehrabi et al., 2021c). After identifying the potential risk of harmful stereotypical and biased views in the analyzed knowledge bases, the immediate question became the following: Upon existence of harmful stereo- typical and biased views injected by the outside world into our knowledge bases, how do we quantify such biases and show the extent to which widespread this issue is in various CSKBs? To answer this question, we propose concepts that serve as proxies to measure existing biases in CSKBs. We define bias as non-neutral views towards the entities studied in the knowledge base which can take the form of either prejudice (expression of negative views) or favoritism (expression of positive views) and use two classifiers from the NLP literature as proxies to capture such biases. We use sentiment (Hutto and Gilbert, 2014) and regard (Sheng et al., 2019) classifiers as approximations to measure the polarization in a sentence. A positive label assigned by the sentiment classifier would serve as proxy to favoritism or positive polarization and a negative label an indication of prejudice or negative polarization. Similarly, a positive label assigned by the regard classifier would serve as a proxy to favoritism and a negative label prejudice. We ideally want neutral labels to be associated to all the statements in the knowledge base for a non-polarized view towards different target groups or entities existing in the knowledge base. 10 In addition, we introduce two concepts to formalize and quantify existing harms using the proxies (sentiment and regard classifiers) and definition stated above. (i)Intra-targetOvergener- alization in which we consider reporting the average percentage of non-neutral statements that would be captured by sentiment and regard classifiers as proxies to label favoritism or prejudice (in terms of positive and negative associated labels) for a particular target group or entity in mind in all the statements as a collective knowledge representing the target group in the knowledge base. Notice that this measure is targeting one particular target group or entity (e.g.,“lawyer”) and reports a score to capture intra-target overgeneralization within the group for that particular target group over all the existing statements that represent that target group or entity. (ii)Inter-target Disparity in which we introduce two separate concepts to capture disparity from various angels 1)representation disparity in which we consider differences between various target groups in the knowledge base according to the frequency of the knowledge existing in the knowledge base 2) overgeneralization disparity in which we report the existing disparity in the overgeneralization scores obtained from the intra-target overgeneralization measure per target group. To report disparity over both of the concepts, we use variance. Unlike the inter-target measure which considers the existing biases and scores within a group, the intra-target disparity considers scores across various groups that might be the subject of the study and reports the existing disparities amongst them. We study 321 target groups from four different categories: origin, gender, religion, and profession. We collected and studied more than 100k triples from ConceptNet and 30k statements containing the target groups in mind from the GenericsKB knowledge base. We then report both qualitative and quantitative results in Figures 3.1 and 3.2. Figure 3.1 presents the box plots of negative and positive regard and sentiment percentages for targets in our four studied categories for both CSKBs. The presence of outliers in these plots demonstrate the existence of intra-target overgeneralization harm associated to a particular group. In these outlier cases, the sentiment and regard percentages can span up to 30% for positive sentiment from ConceptNet and 80% from GenericsKB. Similarly, these scores can span up to 17% for negative regard from ConceptNet and 100% from GenericsKB. From observing the span of the boxes and comparing them across ConceptNet and GenericsKB, we deduced that the scales of biased percentages are much higher in GenericsKB than ConceptNet. The span of the boxes also represent the overgeneralization disparity and inter-target overgeneralization disparity across different studied target groups. Figure 3.2 qualitatively demonstrates both intra-target overgeneralization as well as inter- target overgeneralization disparity in different categories by plotting the negative and positive regard percentages for each target along the x and y coordinates. For instance, from the “Profession,” category some target professions such as “CEO” have a higher positive regard percentage (blue region) and thus a higher intra-target overgenaralization in terms of favoritism. In contrast, some professions, such as “politician” have a higher negative regard percentage (red region) representing a higher intra-target overgenaralization in terms of prejudice. In addition, some professions, such as “psychologist” have both high negative and positive regard percentages (purple region) and high positive and negative intra-target overgenaralization. Moreover, the spread of different 11 target groups and their placements across the plots demonstrate the existence of inter-target overgeneralization disparity. Origin Gender Religion Profession 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Regard (%) ConceptNet Regard Negative Positive Origin Gender Religion Profession 0 5 10 15 20 25 30 Sentiment (%) ConceptNet Sentiment Negative Positive Origin Gender Religion Profession 0 20 40 60 80 100 Regard (%) GenericsKB Regard Negative Positive Origin Gender Religion Profession 0 10 20 30 40 50 60 70 80 Sentiment (%) GenericsKB Sentiment Negative Positive Figure 3.1: Negative and positive regard and sentiment results from ConceptNet and GenericsKB. We find outlier target groups with high regard and sentiment percentages that show the severity of intra-targetovergeneralization. We also find large overgeneralizationdisparity for groups in the same category indicated by the span of boxes. Negligible Bias Favoritism Prejudice Both Negligible Bias Favoritism Prejudice Both Negligible Bias Favoritism Prejudice Both Figure 3.2: Qualitative examples representing targets from the “Profession” and “Religion” categories labeled by the regard measure. Colored regions indicate favoritism, prejudice, both prejudice and favoritism, and somewhat neutral. Higher negative regard percentages indicate prejudice- leaning and higher positive regard percentages indicate favoritism-leaning. We also compare ConceptNet (Speer et al., 2017) and GenericsKB (Bhakthavatsalam et al., 2020) on the “Religion” category by putting both of the results side by side and find similar polarized perceptions for certain target groups, despite a much larger percentage range for GenericsKB. 3.1.2 EffectsonDownstreamTasks There are various models that are trained on the data coming from knowledge bases (e.g., Con- ceptNet) for different downstream tasks and applications. If the data coming from these resources contains biases and harms, it is expected for the models trained on the same data to reflect similar types of biases and harms in their outcomes. This is due to the fact that the model just learns from the data and tries to mimic its behavior in the same way as children would learn from their parents. To demonstrate the effects of a biased data on various downstream tasks, we picked 12 Origin Origin Origin Origin Figure 3.3: Sentiment and regard results showing overgeneralization issues in COMeT and CSG. ConceptNet, which we previously discussed various bias issues associated with it, and studied its effects on two downstream models that use ConceptNet to be trained over. As our first downstream task, we consider the task of commonsense knowledge base completion on the COMeT model (Bosselut et al., 2019), built by fine-tuning a pre-trained GPT model (Radford et al., 2018) over ConceptNet triples. For the second downstream task, we consider Commonsense Story Generation (CSG) (Guan et al., 2020) which generates 3 to 5 sentences to tell a story about a provided prompt. Similar to COMeT, the CSG model is fine-tuned over external commonsense knowledge constructed from ConceptNet and ATOMIC (Sap et al., 2019). To prepare the data for the analysis of the first task, commonsense knowledge base completion, we collected statements in COMeT by inputting the same target words used as prompts in analyzing ConceptNet and collected triples by following all the existing relations in the model. We generated 112k statements and performed a similar analysis with the new generated results as we did for ConceptNet. For the second task, CSG, we prompted the CSG model using templated sentences about the social perception of a certain target. We split our targets into: people,locations,professions, and others classes and manually created 30 templates and populated them with the target words from an appropriate class. We then generated prompts using the populated templates, resulting in approximately 3k prompts for CSG. We ended up with a total of 12k sentences and we performed the analysis based on all the sentences in a given story. From the results presented in Figure 3.3, we observe the effect of the overgeneralization issue on statements generated by COMeT as well as CSG. In addition, sentiment scores as high as 50% for some of the targets in some categories from COMeT and 60% for CSG represent the severity of overgeneralization bias. We also observe the effect of the disparity in overgeneralization in terms of inter-target overgeneralization disparity by the existing variances in both regard and sentiment measures for both tasks. These results demonstrate that some targets are extremely favored or disfavored according to regard and sentiment percentages over other targets, and that this disparity is still apparent amongst targets. 3.1.3 MitigationStrategy So far we have observed and identified the types of existing harms and biases in some data sources. We have quantified such biases and analyzed their effects on downstream tasks and models that utilize these data resources as training data. The one important remaining question is the following: How can we mitigate the effects associated with existing biases in our data resources? One intuitive 13 NSM↑ NSV↓ NRM↑ NRV↓ HNM↑ Quality↑ COMeT 62.6 33.4 78.2 63.2 55.8 55.8 COMeT-Filtered 63.2 32.3 78.6 56.7 60.5 49.9 Table 3.1: Mitigation results from the pre-processing filtering technique (COMeT-Filtered) com- pared to standard COMeT. COMeT-Filtered is effective at reducing overgeneralization and disparity according to sentiment and regard measures along with human evaluation results. Despite the effectiveness of the filtering approach in reducing bias, the quality of generated triples are reduced compared to COMeT. Hence, the quality of generation is compromised over reduction in bias. solution to this problem would be to filter out bad and polluted data points from the data resources. Since we have measures to quantify and identify the problematic data instances, we can utilize them to flag problematic instances for us to remove them from the process of utilizing them to train our models over. Thus, to mitigate the observed representational harms and biases in ConceptNet and their effects on downstream tasks, we propose a pre-processing data filtering technique that reduces the effect of existing representational harms in ConceptNet. In our bias mitigation approach, the triples from ConceptNet are initially passed through regard and sentiment classifiers and only get included in the training process of the downstream tasks if they do not contain representational harms and biases according to our regard and sentiment measures. In other words, in this framework, all the biased triples that were associated with positive or negative labels from regard and sentiment classifiers get filtered out and only neutral triples with neutral labels get used. We apply our mitigation technique on COMeT as a downstream model as a case study and report the results by filtering ConceptNet and retrain COMeT on the filtered data. To measure the effectiveness of the mitigation technique on overgeneralization, we consider increasing the overall mean of neutral triples which is indicative of reducing the overall favoritism and prejudice according to sentiment and regard measures. We report the effects on overgenaralization on sentiment as Neutral Sentiment Mean (NSM) and regard measure as Neutral Regard Mean (NRM). As presented in Table 3.1, our filtered model is able to reduce the unwanted biases in terms of prejudice and favoritism and reduce the overgeneralization issue by increasing the overall neutral sentiment and regard means. To measure the effectiveness of our proposed mitigation strategy on disparity in overgener- alization, we consider reducing the existing variance amongst different targets. We report the disparity in overgeneralization on sentiment as Neutral Sentiment Variance (NSV) and on regard as Neutral Regard Variance (NRV). As presented in Table 3.1, our filtered model (COMeT-Filtered) reduces the variance and disparities amongst targets over the standard COMeT model with no filtering involved in terms of regard and sentiment measures. In addition to reporting automatic measures in terms of regard and sentiment scores, we perform human evaluation on 3,000 generated triples from COMeT and COMeT-Filtered models. The goal behind performing human evaluations are to evaluate the quality of the generated triples 14 as well as the bias aspect of it from the human perspective on Amazon Mechanical Turk. From the results presented in Table 3.1, we observe that COMeT-Filtered is construed to have less overall overgeneralization harm according to humans reported as Human Neutral Mean (HNM). However, reduction of bias compromised the quality of the generated triples as COMeT-Filtered is rated to have lower quality compared to standard COMeT in terms of validity of its triples. 3.2 BiasinAlgorithms Although biases in data can appear and result in negative consequences on the downstream tasks and models, the way we design our algorithms can also contribute to the bias in building our systems. Thus, we are not limited to biases in data, but also algorithmic choices we make can also contribute to bias also known as algorithmic bias. Similar to the data bias, algorithmic bias can also exist in the feedback loop and play a role in bias amplification and circulation resulting in adverse effects on human lives and our society. One of such adverse effects is called exclusion bias. Exclusion bias happens when certain groups are unfairly excluded to be part of a group that they deserve to belong to or their voices are excluded from being heard. We will discuss how our algorithmic choices can result in exclusion bias and methods to mitigate them by proposing new algorithmic solutions. More specifically, we will consider community detection algorithms in the network science domain that are applied over social networks to identify various existing communities in a given network. We will demonstrate how certain groups of people, due to the topology of the network they are in, can be scapegoats of exclusion bias as a result of biased algorithms. 3.2.1 ExclusionBiasinCommunityDetectionAlgorithms Humans are social creatures who survive and live through communication and interaction. With the evolution of technology, human communication also takes different shapes and forms. For instance, we tend to interact, share and gather information, and communicate more on social media these days than in person like we used to do in the past. These new communication means also change the ways our voices are heard or spread over various groups of people and also group associations and friendship circles. Through our social interactions on various social media platforms we form social networks which in turn affect the ways we see and spread information and form group memberships. Although, our social interactions on different social networks can have numerous benefits, they can also have negative consequences for some groups of people if they are excluded from being heard or treated unfairly. One of such unfair treatments is when people’s opinions are not counted towards a cause or their membership into a group is not counted due to the network structure or their friendship circles. If we naively design algorithms that heavily focus on network structures without considering additional information, we might put some groups of people in a disadvantage and exclude them from being heard and consequently result in exclusion bias. 15 Gamergate Dataset U.S. Election Dataset Method Discarded Unique Hashtags Examples Discarded Unique Hashtags Examples CESNA 597 (38.6%) #gamerignorance, #ethicsinjournalism, #feministgate, #sexismiswickedcool 1518 (66.9%) #refugeeswelcome, #Guns, #OscarsStillSoWhite, #TrumpDumped Louvain 625 (40.4%) #misandricfeminists, #violenceagainstwomen, #gamersagainstgamergate, #stopgamejournalism2014 302 (13.3%) #Christians4Hillary, #Killary, #FakeHateCrimes, #TrumpRiots Table 3.2: Counts along with examples from information that is missed by two traditional commu- nity detection algorithms demonstrating the effect of the exclusion bias. With the advancement of data and network sciences along with increasing popularity of social media platforms and usages, researchers started to study different social phenomena through performing analysis over social networks created on various social media platforms. Naturally, researchers started to develop algorithms and methods to perform such studies. One of the primary problems that researchers look into in studying social networks is community detection. In community detection, the goal is to detect and understand the communities within the social structures provided by the network. Although community detection is considered an important task in social network analysis, many community detection approaches which rely on the network structure suffer from a major flaw that is the inability of these approaches to assign lowly connected users to communities and result in exclusion bias (Mehrabi et al., 2019). Lowly connected nodes or users are users who are not well connected to the network due to various reasons one of which can be due to their small friendship circles. These nodes do not form many edges to get connected to other nodes in the graph. While low connected users are by definition not well-integrated into the social network, the information they provide can be crucial for a better understanding of the motivations and beliefs of the community. This issue becomes even more important considering that there could exist a long-tail of such low connected users in the network. Failure to incorporate low connected users may result in biased outcomes and unfair treatment of these users. This is what we define as exclusion bias in community detection algorithms. For instance, studying communities within a social network can be biased towards users whose tweets get a large number of retweets. This will lead to a biased analysis of the data in which only the voices of more popular users who get more retweets get heard and not users with less popularity. We demonstrate the existence of exclusion bias as a form of algorithmic bias in existing community detection algorithms. More specifically, we identify omitted information by existing community detection approaches that exclude the voices of low connected users. With this goal in mind, we study two state-of-the-art community detection algorithms, CESNA (Yang et al., 2013), and the Louvain (Blondel et al., 2008). Louvain uses only the network structure when assigning communities, while CESNA uses both the network structure and user attributes. We then apply these algorithms to two separate datasets: (i) Gamergate and (ii) the 2016 U.S. 16 Presidential Election. The Gamergate dataset consists of tweets posted in 2014 about the Gamergate controversy (Mortensen, 2018). The 2016 U.S. Presidential Election dataset contains users who discuss the U.S. presidential election of 2016. Both of these datasets contain two major communities representing two different opposing groups in the discussions. Table 3.2 demonstrates the information that is omitted by excluding the low connected users and assigning them into insignificant communities. We observe that by assigning these low connected nodes (that are mostly raising the voices of disadvantaged and less representative groups) to insignificant communities, we lose important hashtags, such as “#refugeeswelcome” and “#Christians4Hillary”, that otherwise are not included in the major detected communities which are more likely to be considered in various performed analysis. More examples can be found in Table 3.2. We also reported the number of such unique discarded hashtags and as reported these numbers can go as high as 66.9% and 40.4% for some algorithms over some datasets. 3.2.2 MitigationStrategy We have demonstrated that ignoring low connected users introduces exclusion bias into the resulting analysis. Placement of low connected users into insignificant groups may create the illusion that these small communities are not particularly relevant to the major groups that need to be studied and can be ignored. We demonstrated that low connected users had relevant and informative content which is crucial to keep them in the significant communities. In light of this, we propose a community detection algorithm that addresses exclusion bias while maintaining higher predictive accuracy by putting more relevant users into their corresponding communities according to ground truth labels. Our introduced community detection algorithm CLAN, Communities from Lowly connected Attributed Nodes, uses node attributes, such as text, alongside the network structure to classify the insignificantly labeled users into their correct belonging significant communities. As presented in Algorithm 1, CLAN is a two-step process in which we first use unsupervised learning to detect communities using network attributes and modularity value. Once we identify the communities, we turn the problem into a supervised classification problem where we classify the insignificantly labeled users from insignificant communities into the major communities using additional node attributes that were held out in the first step. The features used for supervised classification can be any held out node features, such as text or hashtags used in tweets. We compare CLAN, with two state-of-the-art methods, namely CESNA and Louvain, as baselines in our experiments. As mentioned previously, while Louvain only uses the network structure, CESNA (similar to our approach CLAN) uses both the network structure as well as node attributes. In our experimental results we show that CLAN has a superior performance and is in a higher agreement with the ground truth communities while at the same time addresses exclusion bias by incorporating more relevant users into their corresponding groups. 17 Algorithm1: CLAN Input: Network, Threshold Output: Communities // Step 1: Finding the communities 𝐶 = find_communities(Network) // Step 2: Classifying minority users into significant communities for𝐶 𝑖 in𝐶 do if count(𝐶 𝑖 )> Threshold then //These are the significant communities Add𝐶 𝑖 to training set end else //These are the insignificant communities Add𝐶 𝑖 to test set end end model = train(training set) predictions = model.inference(test set) for p in predictions do if p.label ==𝐶 𝑖 then Add p.data to𝐶 𝑖 end end return C Table 3.3 reports F1 and Jaccard similarity scores obtained from comparisons made between the ground truth labels and labels obtained by applying each of the methods on the two datasets. These scores consider performance of each community detection algorithm in terms of how accurately the communities are detected considering the ground truth community labels. In addition to reporting the F1 and Jaccard score results, we report the number of significantly labeled users by each of the methods. These scores show the number of users that the method has appropriately labeled into their significant communities preventing them from being ignored in insignificant communities and resulting in mitigation of exclusion bias. The more significantly labeled users a method has, the less susceptible it is to exclusion bias. Overall, results confirm that our method, CLAN, has mitigated the exclusion bias towards low connected users by incorporating them into the significant communities while maintaining higher performance in detecting more accurate communities in terms of F1 and Jaccard scores. The results reported in Table 3.3 confirm that not only is our method able to achieve superior predictive accuracy, but it also mitigates bias against insignificantly labeled users by assigning them labels that prevent them from exclusion. 18 Gamergate U.S. Election Method Significantly Labeled F1 Score Jaccard Significantly Labeled F1 Score Jaccard CESNA 0.31 0.343 0.211 0.05 0.253 0.149 Louvain 0.79 0.434 0.282 0.80 0.753 0.604 CLAN 1.00 0.478 0.318 1.00 0.787 0.649 Table 3.3: Results obtained from calculating F1 and Jaccard similarity scores with regards to the ground truth labels for each of the algorithms along with the number of significantly labeled users indicating the effect on mitigating exclusion bias. 3.3 BiasinModels Oftentimes, by analyzing outputs of certain models one can realize that those outputs contain different types of biases. These biases can take various shapes and forms and can be as a result of different phenomena. Earlier, we discussed the case of representational harms or biases in two downstream models that were trained on biased data resources and as a result propagated the bias from the input training data down to the model outputs. Here, we will discuss another type of bias in a different model. More specifically, we will discuss existing gender biases in Named Entity Recognition (NER) systems. In NER systems, the goal is to identify existing entities in a given text and tag them appropriately. However, upon analyzing different NER systems, we noticed that not only female names get assigned to wrong tags frequently but also in some cases they might even be missed from being recognized as an entity in a given text. We observed that these types of issues are more prominent amongst female names compared to male names. We will first discuss our observations associated with existing gender biases in named entity recognition systems. We will then discuss our curated benchmark dataset as well as our proposed metrics to quantify biases in these models. We will lastly discuss sources of bias and some analysis performed with that regard. 3.3.1 GenderBiasinNamedEntityRecognitionSystems To study existing biases in Named Entity Recognition (NER) models, we analyzed several state- of-the-art NER systems (Mehrabi et al., 2020). More specifically, we studied the difference in the ability of these models to recognize male and female names as PERSON entity types. Towards this goal, we evaluated NER models on a dataset containing 139 years of U.S. census baby names and found that relatively more female names were tagged as non-PERSON than male names or not even recognized as an entity although the names were used in a context where they should have been marked asPERSON. For instance, “Charlotte,” a popular and top ranked female name is almost always tagged wrongfully as a location by the state-of-the-art NER systems despite being used in a context when it is clear that the entity should be a person. In sentences, such as 19 “Isabel is sleeping,” Isabel is tagged as miscellaneous and in “Olivia drinks water,” Olivia is not even recognized as an entity. To analyze these cases further and in scale, we curated a benchmark dataset along with metrics to quantify our findings in terms of existence of gender bias in various NER models. Our benchmark dataset consists of nine templates which are templated sentences that start with the existing names in the census data followed by a sentence that represents a human-like activity (e.g., “<Name> is a person” or “<Name> is going to school”). We report the extent of this bias in several NER systems used prominently in industry and academia. More specifically, we evaluated five named entity recognition models: Flair (Akbik et al., 2018, 2019), CoreNLP version 3.9 (Manning et al., 2014; Finkel et al., 2005), and Spacy version 2.1 small, medium, and large models. We test these models against 139 years of U.S. census data from years 1880 to 2018 by populating “<Name>”s in our benchmark templates with the names coming from the census data. We evaluate these models based on how well they recognize these names as a PERSON entity. In addition, to quantify the existing gender biases in these models over our benchmark dataset, we utilized six different measures as follows ( 𝑁 𝑓 is the set of female names in a particular year. The same error is calculated for male using𝑁 𝑚 ): ErrorType-1Unweighted. This error type measures names that are tagged as non-PERSON, or not tagged at all. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝐼(𝑛 𝑡𝑦𝑝𝑒 ≠𝑃𝐸𝑅𝑆𝑂𝑁) |𝑁 𝑓 | ErrorType-1Weighted. This error type is similar to Error Type-1 Unweighted; however, we consider how popular the mistaken name is based on the census data. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛 𝑡𝑦𝑝𝑒 ≠𝑃𝐸𝑅𝑆𝑂𝑁) Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛) , where𝑓𝑟𝑒𝑞 𝑓 (·) is the frequency/popularity of a name in the female census data for a particular year. Similarly,𝑓𝑟𝑒𝑞 𝑚 (·) is the frequency/popularity of a name from the male census data. ErrorType-2Unweighted. This error type measures only names that are tagged, but whose tags are non-PERSON. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝐼(𝑛 𝑡𝑦𝑝𝑒 ∉{∅,𝑃𝐸𝑅𝑆𝑂𝑁}) |𝑁 𝑓 | , where∅indicates that the name is not tagged. ErrorType-2Weighted. This error type is similar to Error Type-2 Unweighted; however, here we consider how popular the mistaken name is based on the census data. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛 𝑡𝑦𝑝𝑒 ∉{∅,𝑃𝐸𝑅𝑆𝑂𝑁}) Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛) 20 ErrorType-3Unweighted. This error type measures only names that are not tagged. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝐼(𝑛 𝑡𝑦𝑝𝑒 =∅) |𝑁 𝑓 | ErrorType-3Weighted. This error type is similar to Error Type-3 Unweighted; however, we consider how popular the mistaken name is based on the census data. This error type is formalized as follows: Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛 𝑡𝑦𝑝𝑒 =∅) Í 𝑛∈𝑁 𝑓 𝑓𝑟𝑒𝑞 𝑓 (𝑛) We apply these measures utilizing our benchmark dataset to report the existing gender biases in the studied five different NER models. For all the reported results we utilized the “ <Name> is a person” template from our benchmark. Our results in Figures 3.4 and 3.5 indicate that all the models in most cases are more biased toward female names compared to male names. In addition, we observe that results in weighted cases demonstrate more bias toward female names which is indicative that more frequent and popular female names are susceptible to bias and error in named entity recognition systems. In addition to reporting results only on the “<Name> is a person” template, we reported results for all the existing templates in our benchmark dataset for Error Type-1 Weighted which is considered to be the super-set for other error types in Figure 3.6. These additional results demonstrate that context helps some models over others by bringing down the error rates when sentences are added to the names as additional context (templates #2 through #9). Other models perform better on template #1 (which has no context), showing that context can in some cases confuse the model. We observe that in contextual-based models such as Flair, context indeed helps the model to have less error rates for templates #2 through #9 compared to template #1. Other models do not necessarily follow this pattern. 3.3.2 SourcesofBias After identifying and quantifying gender bias in different NER models, we investigated some possible sources of existing biases. As discussed previously, data plays a critical role in our models. Thus, it is likely that existing biases in data sources seep into our models. With this goal in mind, we analyzed some of the datasets widely used in training NER models to determine whether they contain any biases toward a specific group that could result in the biased behavior observed in the outputs of the models. We analyzed train, test, and development sets from two widely known datasets that were used in training and testing of Flair, Spacy, and many other models: CoNLL-2003 1 (Sang and De Meulder, 2003) and OntoNotes-5 2 (Weischedel et al., 2012). We reported percentages of male 1 https://www.clips.uantwerpen.be/conll2003/ner/ 2 https://catalog.ldc.upenn.edu/LDC2013T19 21 (a) Error Type-1 Weighted 0 0.02 0.04 0.06 0.08 0.1 0.12 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male (b) Error Type-2 Weighted 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male (c) Error Type-3 Weighted 0 0.005 0.01 0.015 0.02 0.025 0.03 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.02 0.04 0.06 0.08 0.1 0.12 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.005 0.01 0.015 0.02 0.025 0.03 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.005 0.01 0.015 0.02 0.025 0.03 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male Figure 3.4: Weighted error results obtained from different models that spanned the 139-year history of baby names from the census data on different error types. Female names have higher error rates for all the cases. The y axis represents the calculated error rates for each of the error types, and the x axis represents the year in which the baby name was coming from. vs. female names from the census data that appeared in train, test, and development sets in each of these datasets and compared them to the percentages of male vs. female names from the census data which is representative of the reality to observe how much these datasets are reflective of the reality or if they pertain to any bias toward a specific gender group. Our results presented in Table 3.4 indicate that unlike the census data, which is representative of real-world statistics, wherein female names have more versatility—62% unique female names vs. 38% unique male names—datasets used in training the NER models contain 42% female names vs. 58% male names from the census data. Not only these datasets do not contain more versatile 22 (a) Error Type-1 Unweighted 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.05 0.1 0.15 0.2 0.25 0.3 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male (b) Error Type-2 Unweighted 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male (c) Error Type-3 Unweighted 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.02 0.04 0.06 0.08 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0.89 0.9 0.91 0.92 0.93 0.94 0.95 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.02 0.04 0.06 0.08 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0.89 0.9 0.91 0.92 0.93 0.94 0.95 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.02 0.04 0.06 0.08 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0.89 0.9 0.91 0.92 0.93 0.94 0.95 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.02 0.04 0.06 0.08 0.1 0.12 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1875 1895 1915 1935 1955 1975 1995 2015 Flair female male 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 1875 1895 1915 1935 1955 1975 1995 2015 Small Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1875 1895 1915 1935 1955 1975 1995 2015 Medium Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male 0 0.02 0.04 0.06 0.08 0.1 0.12 1875 1895 1915 1935 1955 1975 1995 2015 Large Spacy female male 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 CoreNLP female male Figure 3.5: Unweighted error results obtained from different models that spanned the 139-year history of baby names from the census data on different error types. Female names have higher error rates for most of the cases. The y axis represents the calculated error rates for each of the error types, and the x axis represents the year in which the baby name was coming from. female names to reflect the reality, but instead have less variety which can bias the models by not covering enough female names. 3.4 Conclusion Existence of bias in various stages of AI development pipeline can contribute to unfairness and different undesired outcomes. Escaping from bias is almost inevitable as we live in a world where different types of biases already have seeped into many aspects of our lives. Thus, the data and the information which we produce is not free of bias. Since AI systems heavily rely on the information 23 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 1875 1895 1915 1935 1955 1975 1995 2015 Female Flair TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 Female Large Spacy TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 1875 1895 1915 1935 1955 1975 1995 2015 Female Flair TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.1 0.2 0.3 0.4 0.5 0.6 1875 1895 1915 1935 1955 1975 1995 2015 Female Large Spacy TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1875 1895 1915 1935 1955 1975 1995 2015 Female CoreNLP TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 Male Flair TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1875 1895 1915 1935 1955 1975 1995 2015 Female CoreNLP TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 1875 1895 1915 1935 1955 1975 1995 2015 Male Flair TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1875 1895 1915 1935 1955 1975 1995 2015 Male Large Spacy TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 1875 1895 1915 1935 1955 1975 1995 2015 Male CoreNLP TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1875 1895 1915 1935 1955 1975 1995 2015 Male Large Spacy TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 0 0.05 0.1 0.15 0.2 0.25 1875 1895 1915 1935 1955 1975 1995 2015 Male CoreNLP TMPL1 TMPL2 TMPL3 TMPL4 TMPL5 TMPL6 TMPL7 TMPL8 TMPL9 Figure 3.6: Error Type-1 Weighted score results obtained from different models on different templates from our benchmark for female and male names collected over 139 years. Context in some of the templates helped some models, but had negative effects on other models. that we provide them (the so-called data), the existing bias in the data creeps into the systems and contaminates the overall development pipeline. This is due to the fact that bias is not static, and it moves through various stages of development. Bias can propagate from data to the model outcomes and get back to the user. The user will then feed the bias back to the AI system which creates a cyclic loop and contributes to bias amplification. Not only the data, but the way we design our algorithms can also contribute to bias. These biased algorithmic design choices can produce unfair outcomes in various shapes and forms. One example of such biases can be excluding people from being heard. By being more aware of the existence of such biases, we can design better algorithms that can mitigate the effects of exclusion bias and incorporate more people in getting their fair shares. The output of our models can also reflect various types of biases, such as gender Dataset FemaleCount MaleCount FemalePct MalePct Census 67,698 41,475 62% 38% Train CoNLL 2003 1,810 2,506 42% 58% OntoNotes5 2,758 3,832 42% 58% Dev CoNLL 2003 962 1,311 42% 58% OntoNotes5 1,159 1,524 43% 57% Test CoNLL 2003 879 1,228 42% 58% OntoNotes5 828 1,068 44% 56% Table 3.4: Percentage of female and male names from the census data appearing in CoNLL 2003 and OntoNotes datasets with their corresponding counts. Both datasets fail to reflect the variety of female names. 24 bias, which needs to be correctly identified, quantified, and mitigated for us to have more reliable and responsible AI systems. We demonstrated that fairness is one of the many important concepts in responsible AI literature and careful attention needs to be paid to unintentional biases that can affect our data, models, and algorithms. 25 Chapter4 AnalyzingandImprovingRobustnesstoSocietalConcerns Thus far, we have talked about unintentional factors that can lead to biased outcomes, such as existing societal biases that can seep into our data through crowdsourcing inevitably, but what about intentional adversarial forces that can maliciously steer the outcomes towards unfair decisions or in general any malicious activity that can target different societal factors such as triggering a model to generate toxic outcomes? We are interested in investigating whether there exists a way to analyze robustness of models against such malicious and intentional factors along with methods to bypass them if possible around societal concerns. Although previously researchers have extensively studied this concept from a pure performance perspective away from societal concerns, here we are specifically interested in studying societal factors. Thus, we will first propose adversarial attacks that can target fairness of the ML models. We will investigate robustness of ML models to such intentional factors that can target fairness. We will then move toward NLP systems and analyze robustness of generative models in NLP domain towards intentional attacks that can trigger toxic content generation in these systems. We will lastly propose a defense against such intentional attacks targeting toxicity. The defense mechanism will use interpretability techniques that can help increase robustness of these systems to intentional or unintentional phenomena. 4.1 RobustnesstoUnfairness Previously, it has been extensively demonstrated that machine learning models can be susceptible to various types of adversarial attacks that can target performance of these models (Chakraborty et al., 2018). However, previous research in adversarial machine learning mostly focused on targeting accuracy (Li et al., 2018). We argue that, similar to accuracy, fairness measures can be targeted by malicious adversaries (Mehrabi et al., 2021b). For instance, adversaries may attack government agencies to make them appear unfair in order to depreciate their value and credibility. Adversaries can even profit from attacking models used in various applications (e.g., loan, credit, school admission) by biasing the decisions to their benefit. Thus, fairness should be considered as an element when assessing robustness of ML systems. 26 4.1.1 AttacksonFairness In light of analyzing robustness of ML systems, we propose data poisoning attacks that target fairness to study vulnerabilities of these models. More specifically, we propose two families of data poisoning attacks: anchoring and influence . In anchoring attack, poisoned points are placed strategically to affect fairness without modifying the attacker loss based on heuristics. On the other hand, our influence attack on fairness affects both fairness and accuracy via poisoned points during train time through a specific adversarial loss that regularizes fairness and accuracy losses. To summarize, in anchoring attack, we place poisoned points strategically to bias the decision boundary based on some heuristic; in influence attack, we target fairness measures by incorporating a loss function maximizing which can degrade fairness. This loss targets fairness by maximizing the covariance between the decision outcome and sensitive attributes. Our anchoring attack is detailed in Algorithm 2. For convenience here we denote the negative outcome as -1. In anchoring attack, the attacker first samples a target point 𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 from the clean data,𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 ∈D 𝑐 . Next, the attacker generates poisoned data point ˜ 𝑥 in the vicinity of 𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 , such that this new point is from the same demographic group but has an opposite label, 𝑑𝑒𝑚𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐(𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 ) =𝑑𝑒𝑚𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐( ˜ 𝑥) and𝑦 𝑡𝑎𝑟𝑔𝑒𝑡 ≠ ˜ 𝑦. The general idea behind the anchoring attack is to strategically target some points (𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 ) and cloud their labels via poisoned points that have opposite labels from the same demographic groups. This would lead to a skewed and biased decision boundary. Algorithm2: Anchoring Attack Input: clean data setD 𝑐 ={(𝑥 1 ,𝑦 1 ),(𝑥 2 ,𝑦 2 ),...,(𝑥 𝑛 ,𝑦 𝑛 )}, poison fraction𝜖, and vicinity distance𝜏. Output: poisoned data setD 𝑝 ={( ˜ 𝑥 1 , ˜ 𝑦 1 ),( ˜ 𝑥 2 , ˜ 𝑦 2 ),...,( ˜ 𝑥 𝜖𝑛 , ˜ 𝑦 𝜖𝑛 )}. for t= 1,2,...do Sample negative𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 − fromD 𝑎 and positive𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 + fromD 𝑑 with random or non-random technique. G + : Generate(|D − 𝑐 |𝜖) positive poisoned points( ˜ 𝑥 + ,+1) withD 𝑎 in the close vicinity of𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 − s.t.|| ˜ 𝑥 + −𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 −|| 2 ≤𝜏. G − : Generate(|D + 𝑐 |𝜖) negative poisoned points( ˜ 𝑥 − ,−1) withD 𝑑 in the close vicinity of𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 + s.t.|| ˜ 𝑥 − −𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 +|| 2 ≤𝜏. LoadD 𝑝 from the generated data aboveD 𝑝 ←G + ∪G − . Load the feasible setF 𝛽 ←𝐵(D 𝑐 ∪D 𝑝 ). for i=1,2,...,𝜖𝑛 do Set ˜ 𝑥 𝑖 ←𝑎𝑟𝑔𝑚𝑖𝑛 𝑥∈F 𝛽 ||𝑥− ˜ 𝑥 𝑖 || 2 . (To projectD 𝑝 back toF 𝛽 ). end 𝑎𝑟𝑔𝑚𝑖𝑛 𝘃 L(𝘃;(D 𝑐 ∪D 𝑝 )). end The right plot in Figure 4.1 depicts anchoring attack in which the poisoned points colored in black are placed close to the strategically chosen target points with the same demographic group but opposite label to bias the predictive outcome. This strategic placement of poisoned 27 points during training will lead the decision boundary to skew and, as a result, will cause more advantaged points to get the positive outcome and more disadvantaged points to get the negative predictive outcome. This is biasing the model’s prediction. In addition, x target can be sampled in several ways. We introduce two ways: random and non-random. In random anchoring attack, x target is sampled uniformly at random from each demographic group. In non-random anchoring attack, we choose popular x target as our target for each demographic group. Here, we define popular x target , a point that is close to more similar instances𝑥 𝑖 , eligible to serve as targets, such that𝑑𝑒𝑚𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐(𝑥 𝑖 ) =𝑑𝑒𝑚𝑜𝑔𝑟𝑎𝑝ℎ𝑖𝑐(𝑥 𝑡𝑎𝑟𝑔𝑒𝑡 ) and𝑦 𝑖 =𝑦 𝑡𝑎𝑟𝑔𝑒𝑡 . By picking popular target points, we ensure to affect as much as points similar to x target as possible to maximize the bias. Advantaged Disadvantaged Label +1 Label -1 Poisoned Target (a) Before Attack Advantaged Disadvantaged Label 1 Label -1 Poisonous Victim (b) Anchoring Attack Advantaged Disadvantaged Label 1 Label -1 Poisonous Victim Figure 4.1: The figure on the left represents before anchoring attack is applied. The right figure represents how the anchoring attack is performed in which poisoned points are located in close vicinity (drawn as the large solid circle) of target points. For the influence attack on fairness, we propose a modified version of influence attack in- troduced in (Koh et al., 2018; Koh and Liang, 2017), that can affect fairness with demographic information involved. In influence attack on fairness, we incorporate a loss function maximizing which can harm fairness. We use the loss function proposed in (Zafar et al., 2015b) for fair clas- sification with a constraint involving the covariance between the sensitive features ( 𝑎) and the signed distance from feature vectors to the decision boundary (𝑑 𝘃 (𝑥)) formalized as: 𝐶𝑜𝑣(𝑎,𝑑 𝘃 (𝑥))≈ 1 𝑁 𝑁 ∑︁ 𝑖=1 (𝑎 𝑖 − ¯ 𝑎)𝑑 𝘃 (𝑥 𝑖 ). By combining the above constraint with the original classification loss and maximizing it, the attacker can harm fairness and accuracy at the same time via a regularization term,𝘆, that controls 28 the trade-off between these two. Thus, the loss in influence attack on fairness consists of two parts: ℓ 𝑎𝑐𝑐 andℓ 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 in whichℓ 𝑎𝑐𝑐 controls for accuracy andℓ 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 controls for fairness constraints. 𝐿 𝑎𝑑𝑣 ( ˆ 𝘃;D 𝑡𝑒𝑠𝑡 )=ℓ 𝑎𝑐𝑐 +𝘆ℓ 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 𝑤ℎ𝑒𝑟𝑒 ℓ 𝑓𝑎𝑖𝑟𝑛𝑒𝑠𝑠 = 1 𝑁 𝑁 ∑︁ 𝑖=1 (𝑎 𝑖 − ¯ 𝑎)𝑑 ˆ 𝘃 (𝑥 𝑖 ). The influence attack on fairness would try to harm the fairness constraint with respect to dis- parate impact (Zafar et al., 2017). To maximize the introduced loss, we use the influence attack strategy (Koh et al., 2018; Koh and Liang, 2017) with changes that would incorporate demographic information as shown in Algorithm 3 to make it applicable for our fairness use-case. Similar to the convention in (Koh et al., 2018), we sample one positive and one negative instance uniformly at random. We then make copies of the sampled instances that serve as our poisoned points. Since in our attack we have to take demographic information into consideration, we sample the positive instance fromD 𝑎 and the negative instance fromD 𝑑 . Algorithm3: Influence Attack on Fairness Input: clean data setD 𝑐 ={(𝑥 1 ,𝑦 1 ),(𝑥 2 ,𝑦 2 ),...,(𝑥 𝑛 ,𝑦 𝑛 )}, poison fraction𝜖, and step size𝘂. Output: poisoned data setD 𝑝 ={( ˜ 𝑥 1 , ˜ 𝑦 1 ),( ˜ 𝑥 2 , ˜ 𝑦 2 ),...,( ˜ 𝑥 𝜖𝑛 , ˜ 𝑦 𝜖𝑛 )}. FromD 𝑎 randomly sample the positive poisoned instanceI + ←( ˜ 𝑥 1 , ˜ 𝑦 1 ). FromD 𝑑 randomly sample the negative poisoned instanceI − ←( ˜ 𝑥 2 , ˜ 𝑦 2 ). Make copies fromI + andI − until having𝜖|D 𝑐 | poisoned copiesC 𝑝 . Load poisoned data setD 𝑝 ←{C 𝑝 }. Load feasible set by applying anomaly detector𝐵F 𝛽 ←𝐵(D 𝑐 ∪D 𝑝 ). for t= 1,2,...do ˆ 𝘃←𝑎𝑟𝑔𝑚𝑖𝑛 𝘃 L(𝘃;(D 𝑐 ∪D 𝑝 )). Pre-compute𝑔 ⊤ ˆ 𝘃,D 𝑡𝑒𝑠𝑡 𝐻 −1 ˆ 𝘃 from𝐿 𝑎𝑑𝑣 for details refer to (Koh et al., 2018). for i= 1,2 do Set ˜ 𝑥 0 𝑖 ← ˜ 𝑥 𝑖 −𝘂𝑔 ⊤ ˆ 𝘃,D 𝑡𝑒𝑠𝑡 𝐻 −1 ˆ 𝘃 𝜕 2 ℓ( ˆ 𝘃; ˜ 𝑥 𝑖 , ˜ 𝑦 𝑖 ) 𝜕 ˆ 𝘃𝜕 ˜ 𝑥 𝑖 . Set ˜ 𝑥 𝑖 ←𝑎𝑟𝑔𝑚𝑖𝑛 𝑥∈F 𝛽 ||𝑥− ˜ 𝑥 0 𝑖 || 2 . (To projectD 𝑝 back toF 𝛽 ). end Update copiesC 𝑝 based on updates onI + andI − . Update feasible setF 𝛽 ←𝐵(D 𝑐 ∪D 𝑝 ). end 4.1.2 ExperimentsandResults Lastly, we perform experiments to show the vulnerability of ML models to attacks proposed on various fairness metrics as well as accuracy and their comparison with baseline attacks. For datasets, we use German Credit, COMPAS, and Drug Consumption datasets. For measures, we 29 utilize Statistical Parity Difference, Equality of Opportunity Difference, along with accuracy. We compare influence attack on fairness along with random and non-random anchoring attacks with basic influence attack that targets accuracy only and not fairness (Koh et al., 2018) along with attack proposed by Solans et al. (2020) that is an attack targeted toward fairness. The adversarial loss function used to attack fairness in (Solans et al., 2020) is demonstrated below: 𝐿 𝑎𝑑𝑣 ( ˆ 𝘃;D 𝑡𝑒𝑠𝑡 )= 𝑝 ∑︁ 𝑘=1 ℓ( ˆ 𝘃;𝑥 𝑘 ,𝑦 𝑘 ) | {z } disadvantaged +𝘆 𝑚 ∑︁ 𝑗=1 ℓ( ˆ 𝘃;𝑥 𝑗 ,𝑦 𝑗 ) | {z } advantaged 𝑤ℎ𝑒𝑟𝑒 𝘆 = 𝑝 𝑚 . Results in Figure 4.2 demonstrate that the influence attack on accuracy (Koh et al., 2018), although performing remarkably well in attacking accuracy, is not effective in attacking fairness. Results also confirm that influence attack on fairness (IAF) outperforms the fairness attack proposed in (Solans et al., 2020) as well as the anchoring attacks (RAA and NRAA) in affecting fairness measures. Influence attack on fairness is shown to be the most effective amongst all the attacks in attacking fairness measures. In addition, in Figure 4.3 we demonstrate the effect of the regularized loss in the influence attack on fairness. Results demonstrate that with the increase of lambda, the attack affects fairness measures more as expected; however, for the lower lambda values the attack acts similar to the original influence attack on accuracy. Results also confirm that higher epsilon values (e.g., epsilon value of 1) have more significant effect on attacking fairness measures, while less subtle for lower epsilon values such as 0.1. 4.2 RobustnesstoToxicity Another major societal concern that is considered as a significant research direction in NLP is to detect toxic language. It is important to ensure that our systems stay away from any toxic language and behavior. However, as discussed in previous chapters, most of the data that models are trained over are either crowd-sourced or collected from online resources that are likely to contain toxic content. Although there has been significant effort in reducing the effects of such unintentional toxic content generation in various NLP domains that are likely to come from the data resources, not much attention has been paid to intentional factors that can trigger various models into toxic content generation and methods to bypass them. In this section, we will analyze robustness of generative models against intentional adversarial factors that can trigger toxic content generation and methods to overcome them by proposing attacks and comparing them with baselines along with proposing defense mechanisms. We propose defense mechanisms against such attacks that can not only mitigate the effects of intentional adversarial factors but also unintentional toxic behavior using techniques from interpretability 30 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Test Error IAF NRAA RAA Koh et al Solans et al German 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity IAF NRAA RAA Koh et al Solans et al German 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity IAF NRAA RAA Koh et al Solans et al German 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Test Error IAF NRAA RAA Koh et al Solans et al COMPAS 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity IAF NRAA RAA Koh et al Solans et al COMPAS 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity IAF NRAA RAA Koh et al Solans et al COMPAS 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Test Error IAF NRAA RAA Koh et al Solans et al Drug Consumption 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity IAF NRAA RAA Koh et al Solans et al Drug Consumption 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity IAF NRAA RAA Koh et al Solans et al Drug Consumption Figure 4.2: Results demonstrating the effect of different attacks on different fairness (SPD and EqOpp) and accuracy (test error) measures over three different datasets (German Credit, COMPAS, and Drug Consumption) with different 𝜖 values. literature. Through performing different types of experiments, we demonstrate the validity of such defense mechanisms into mitigating toxic behavior along with improving robustness of models. 4.2.1 AttacksonToxicity Existing work in adversarial attacks on language generation is relatively narrow. Wallace et al. (2019) proposed attacks based onuniversaladversarialtriggers(UAT) that can result in toxic content generation with a relatively high success rate. However, triggers produced in the UAT attack are unnatural and incoherent sequences of words that can be easily detected via a language model loss. Furthermore, such attacks cannot be successful in voice-based dialogue systems where the input to the dialogue model comes from speech recognition and should necessarily conform to human language norms. The goal in universal adversarial trigger attack (UAT) is to find a universal trigger sequence for a given trained model, which if attached to the start of any given input can cause the model to output the desired outcome (Wallace et al., 2019). This attack starts with an initial fixed-length sequence, e.g., “the the the the the the” and iteratively replaces the tokens in the sequence to 31 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Test Error = 0.0 = 0.1 = 1.0 German 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity = 0.0 = 0.1 = 1.0 German 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity = 0.0 = 0.1 = 1.0 German 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Test Error = 0.0 = 0.1 = 1.0 COMPAS 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity = 0.0 = 0.1 = 1.0 COMPAS 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity = 0.0 = 0.1 = 1.0 COMPAS 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Test Error = 0.0 = 0.1 = 1.0 Drug Consumption 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Statistical Parity = 0.0 = 0.1 = 1.0 Drug Consumption 0.0 0.2 0.4 0.6 0.8 1.0 Lambda 0.0 0.2 0.4 0.6 0.8 1.0 Equality of Opportunity = 0.0 = 0.1 = 1.0 Drug Consumption Figure 4.3: Results demonstrating the effect of different lambda values in IAF with regards to different fairness (SPD and EqOpp) and accuracy (test error) measures on three different datasets (German Credit, COMPAS, and Drug Consumption) for different 𝜖 values. satisfy an objective. The iterations terminate when no improvement (replacement) can be made to further optimize the objective. In forcing the models to generate toxic content via this attack, the objective is to search for triggers that can maximize the likelihood of toxic tokens being generated as follows: 𝑓 UAT = ∑︁ 𝑦∈Y |𝑦| ∑︁ 𝑖=1 log𝑃(𝑦 𝑖 |𝑦 1:𝑖−1;𝑡,𝘃 ). whereY is the set of toxic outputs,𝑡 denotes the trigger sequence, and𝘃 is a trained language model. One important limitation of this attack is that since there is no constraint on the trigger, it does not necessarily satisfy any language modeling loss; thus, the obtained trigger sequence usually is a nonsensical phrase that can be easily detected as a high-perplexity anomaly. 32 A naive and intuitive workaround to combat this limitation is to impose a language modeling objective on the trigger tokens in which the objective for UAT-LM attack becomes the following: 𝑓 UAT-LM = 𝑓 UAT + ∑︁ 𝑦∈Y |𝑡| ∑︁ 𝑗=1 log𝑃(𝑡 𝑗 |𝑡 1:𝑗−1 ,𝘃). Note that this objective does not guarantee generation of sufficiently fluent triggers. Even if the generated triggers might be sensible, they will not generally retain the flow of the conversation in terms of coherency and relevancy in conversational systems. In light of this, we propose a different modification to the attack strategy to accommodate these requirements. We introduce animperceptible adversarial attack called Unigram Trigger with Selection Criteria (UTSC) on dialogue systems that produces natural-looking and coherent utterances as triggers. These attack utterances cannot be easily detected using anomaly detection techniques. As such, UTSC can also target voice-based assistants which utilize speech recognition systems. Our proposed approach works by augmenting the UAT from Wallace et al. (2019) with additional selection criteria to generate imperceptible yet effective triggers (Mehrabi et al., 2022a). To consider the history of the conversationℎ and retain the fluency, coherency, and relevancy aspects of the conversation in generating the attack, we generate a collection of unigram triggers fromUAT. We then feed these triggers along with the history of the conversationℎ to our dialogue model and generate different attack utterances. After gathering all the candidate utterances, we pick the best suited attack utterance amongst all the generated attack utterances according to our selection criterion as demonstrated in Figure 4.4. Since we are relying on the dialogue model to generate the attack utterance givenℎ, the generated utterance will be relevant to the conversation. Furthermore, since we are using a unigram trigger from UAT, the fluency of the utterance is not going to be sacrificed noticeably. Adversary generates n unigram triggers (UTs) using UAT starting from a random unigram e.g., “the”. Adversary uses dialogue history (h) and each of the UTs as the context to DialoGPT model and generates adversarial examples (Exs). Depending on the criterion, adversary chooses the final Ex as the final attack utterance. Adversary runs one toxicity classifier or combination of them (average toxicity score) and assigns a toxicity score to each Ex. UT 1 = UAT(the) UT 2 = UAT(a) … UT n = UAT(an) Ex 1 = DialoGPT(h+UT 1 ) Ex 2 = DialoGPT(h+UT 2 ) … Ex n = DialoGPT(h+UT n ) If UTSC-1 (criterion 1): argmax Exi {Score i }. If UTSC-2 (criterion 2): argmin Exi {Score i |Score i >= T}. If UTSC-3 (criterion 3): argmin Exi {Score i }. Score 1 = Toxicity(Ex 1 ) Score 2 = Toxicity(Ex 2 ) … Score n = Toxicity(Ex n ) Figure 4.4: UTSC attack methodology steps. 33 We quantify the toxicity of each candidate attack utterance using either a single toxicity classifier or an ensemble of such classifiers. For multiple classifiers, we use the average, and for a single classifier the raw output probability scores obtained by the toxicity classifiers, which we refer to as the toxicity score𝑥 𝑖 for example utterance𝑖. We then select the final attack utterance amongst the𝑛 candidate adversarial examples considering three selection criteria. Previous work (Xu et al., 2020) has demonstrated that toxic triggers are more likely to provoke toxic responses. Thus, in UTSC-1, we select the most toxic utterance among all generated attack utterances according to toxicity scores from toxicity classifiers as our final attack utterance (i.e., arg max 𝑖∈[𝑛] {𝑥 𝑖 }). We introduce two additional criteria. First is UTSC-2 in which we apply a threshold𝑇 to toxicity scores of the candidate utterances and label the utterances above this threshold as toxic. Next, from the pool of all utterances that were determined to be toxic, we select the utterance with the lowest toxicity score (i.e., arg min 𝑖∈[𝑛] {𝑥 𝑖 |𝑥 𝑖 ≥𝑇}). If no utterances fall above the threshold, then we select the most toxic utterance. Second is UTSC-3 in which we select the utterance with the lowest toxicity score, i.e., arg min 𝑖∈[𝑛] {𝑥 𝑖 }. Aside from automatic attacks, Xu et al. (2020) use human-and-model-in-the-loop framework to generate natural-looking attacks against dialogue models, but this approach is costly and inherently not scalable. Unlike other methods, UTSC is fully automated and scalable, thus affording the exploration of a large number of attack vectors and system vulnerabilities efficiently. UTSC is an attack against conversational agents that is imperceptible, i.e., fits the conversation in terms of coherency, relevancy, and fluency, while is effective and scalable, i.e., it can automatically trigger the system into generating toxic language. 4.2.2 AttackExperimentsandResults In order to compare different attacks from various aspects (e.g., their effectiveness in triggering toxicity, preserving coherency and relevancy in the conversation, as well as fluency), we perform both automatic and human evaluations on discussed attacks. Through various human and auto- matic evaluations we show the effectiveness of the discussed attacks in provoking the defender into generating toxic responses while keeping the fluency and coherency of the conversation intact. To perform the experiments, we use DialoGPT (Zhang et al., 2020b) and generate 100 con- versations around a specific topic. The topic is determined by the context sentence that starts the conversation between the adversary and the defender. The context sentences come from two different datasets, Wizard of Wikipedia (Dinan et al., 2018) and ConvoKit’s Reddit Corpus. 1 50 random context sentences were picked from the Wizard of Wikipedia and 50 from the Reddit datasets. Each conversation runs for 10 turns and the adversary generates the attack utterance on the third turn of each conversation. To determine toxicity of the candidate attack utterances by 1 https://convokit.cornell.edu 34 the adversary in the UTSC attacks, we utilize an ensemble of three different toxicity detection models: Toxic-bert 2 , Perspective API 3 , and Safety classifier (Xu et al., 2020). We report how effective each attack is by calculating the percentage of conversations in which the defender is provoked by the adversary to generate a toxic response. We demonstrate the results comparing the UAT baseline with UAT-LM and UTSC attacks. Results in Figure 4.5 demonstrate that UAT-LM and UTSC-1 are performing the best according to the Perspective API and Toxic-bert classifiers. However, UAT performs the best according to Safety classifier. Results also confirm previous findings reported by Xu et al. (2020) in which authors show that more toxic attacks performed by human adversaries are better in forcing the model to generate toxic utterances. In our results, we also show that UTSC-1 is the strongest amongst the UTSC family of attacks as it relies on most toxic utterances followed by UAT-LM. We also show that UTSC-3 performs the worst which is based on non-toxic utterances followed by the UTSC-2 attack which is based on the least toxic utterance attack constraint. Thus, results confirm that the toxicity of the attack plays a significant role in attack effectiveness. In addition, UTSC-1 and UAT-LM attacks have the advantage of being more fluent and more imperceptible. We report that the average perplexity score according to the GPT-2 language model for the attack phrases generated by UAT is absurdly high (∼10 7 ) compared to∼10 4 for UAT-LM, and∼160 for UTSC-1. The perplexity of the no attack case (unaltered DialoGPT conversations) is∼39. Reporting the perplexity scores automatically confirms that UTSC and UAT-LM attacks are more fluent, natural, and imperceptible. This observation is further confirmed by human evaluations. To compare and verify the quality of conversations generated during and after the attacks, we conduct human experiments in which we ask workers on Amazon’s Mechanical Turk (AMT) to rate the conversations. We asked AMT workers to rate the generated attack’s fluency, relevancy of the attack utterance to previous utterances, and overall conversation coherency on a likert scale of 1 to 3 representing poor, moderate, and good qualities respectively. We also asked AMT workers to rate if the utterance after the attack is toxic or not to verify the effectiveness of the attack according to human judgment. AMT workers annotated 100 conversations from each of the three attacks. In addition, each conversation was annotated by 3 AMT workers giving us overall 900 annotated conversations 300 from each attack. Results from our human evaluation studies are demonstrated in Figure 4.6. Results report that UTSC-1 attack is rated to have the highest coherency, relevancy, and fluency. However, since UAT generates meaningless phrases, it is rated very poorly for all the mentioned qualities. According to human evaluation results with regards to attack effectiveness and toxicity scores, attacks are rated to have competitive and comparable performances at around 20% effectiveness close to automatic results discussed previously. To ensure that the adversary is not overfitting the toxicity detection model but rather forcing the defender to actually generate toxic language, we also study the transferability of these attacks. We allow the adversary to only use one of the toxicity detection models to design its attack. 2 https://github.com/unitaryai/detoxify 3 https://www.perspectiveapi.com 35 T oxic-bert Perspective API Safety classifier 0 10 20 30 40 % Attack Effectiveness UAT UAT-LM UTSC-1 UTSC-2 UTSC-3 No Attack Figure 4.5: Attack effectiveness by toxicity classifier. UAT UAT-LM UTSC-1 1 2 3 Coherency Bad (score 1) Moderate (score 2) Good (score 3) UAT UAT-LM UTSC-1 1 2 3 Fluency Bad (score 1) Moderate (score 2) Good (score 3) UAT UAT-LM UTSC-1 1 2 3 Relevancy Bad (score 1) Moderate (score 2) Good (score 3) UAT UAT-LM UTSC-1 0.0 0.2 0.4 0.6 0.8 1.0 T oxicity T oxic Not toxic Figure 4.6: Human evaluation results for various attacks. Black dotted line represents the average score for a given quality ranging from 1 to 3 indicating bad to good quality. Each bar plot demonstrates proportion of workers that rated a particular score for a given quality (red for bad, yellow for moderate, and green for good). For toxicity, we only have two ratings (toxic and not toxic). We then quantify toxicity using the other two toxicity detection methods, not accessed by the adversary. Results in Figure 4.7 demonstrate that even if the attacker only uses one of the toxicity detection models (Toxic-bert), it can still force the defender to generate toxic responses according to other classifiers (Perspective API and Safety classifier). The results confirm that the attack is forcing the defender to generate actual toxic language rather than fooling the toxicity classifier. 4.2.3 DefensesonToxicity In order to design a robust AI system, it is not sufficient to only showcase the vulnerability of these systems to various attacks, but also how to combat them. In light of this, we introduce methods to improve robustness through proposing defenses that can bypass such toxic behavior against attacks triggering toxic content generation. The defense against adversarial attacks discussed in this section has two components (a) detecting the attack and (b) mitigating its effect by ensuring that the defender does not generate a toxic response. The detection problem is rather straightforward 36 T oxic-bert Perspective API Safety classifier 0 5 10 15 20 25 30 35 40 % Attack Effectiveness UTSC-1 w T oxic-bert UTSC-2 w T oxic-bert UTSC-3 w T oxic-bert No Attack Figure 4.7: Transferability of UTSC attacks among different toxicity classifiers. The adversary uses Toxic-bert to conduct its attack; however, results transfer to Perspective API and Safety classifier as well. as the defense can simply use a toxicity classifier on the generated response and identify whether the content was toxic or not. The mitigation, however, is more challenging. Xu et al. (2020) suggested a mitigation approach called the Two-stage Non Sequitur defense in which when a toxic response is detected, it simply resets the dialogue and generates a (non-toxic) utterance by randomly sampling from a predefined set of non-toxic topics. This baseline is a two- stage approach in which the defender first uses a toxicity classifier to detect if the utterance is toxic or not. If the utterance was detected to be toxic, it then changes the topic of the conversation, e.g., “Hey do you want to talk about something else? How about we talk about X?” where X is a randomly chosen topic from 1087 topics judged as safe from the Wizard of Wikipedia conversational topic list (Dinan et al., 2018). Xu et al. (2020) used this defense against human adversaries that aimed to force the model to generate toxic responses. Notice that although this defense is using a templated sentence to change the topic into a non-toxic topic and can be considered as a perfect solution to avoid generating toxic responses, it can provide the user with a non-plausible conversational experience. This is because the topic of the conversation changes each time the defender detects a toxic utterance. Although this method can be really effective, it can not generate coherent and relevant dialogues. We are interested in mitigation strategies that avoid generating toxic utterances and manage to keep the conversation flow intact. In light of this, we introduce a defense that is based on a two-stage mechanism in which the defender first runs a toxicity detection model on its generated utterance. If the generated utterance is toxic, it proceeds with the second stage of the defense. The defense in the second stage utilizes two layers of reasoning using two different interpretability techniques. The first layer aims to detect tokens in the defender’s utterance that are making the toxicity detection model to label the utterance as being toxic. We call these tokens theL1 tokens. The second layer aims to detect tokens in the adversary’s attack utterance that are responsible for generation of L1 tokens in 37 Original Conversation • Defender runs a toxicity detection classifier on its generated utterance and finds out that it was tricked by the adversary. Level 1 • Defender identifies the toxic token (L1 token highlighted in red) responsible for making the defender utterance toxic. Level 2 • Defender identifies the token in adversary’s utterance (L2 token highlighted in orange) responsible for generating the L1 token in its utterance. Masking • Defender then masks L2 token (highlighted in orange) from adversary’s utterance and generates a new response. Figure 4.8: The two-stage defense framework including interpretable reasonings at levels 1 and 2. defender’s utterance. We call these tokens as the L2 tokens. The defender then masks the L2 tokens from the adversary, which were responsible for triggering the defender model to generate toxic tokens, and generates a new utterance. The defender then applies a toxicity classifier on this new utterance. If it is deemed safe, it is then going to replace the defender’s old toxic utterance, otherwise it iteratively applies the two-stage defense mechanism to mask more input tokens until the generated output is deemed safe. The introduced defense framework is demonstrated in Figure 4.8. The first layer in the defense mechanism uses transformers interpret 4 which provides explanations and identifies the L1 token according to Toxic-bert model. For the second layer, the defense uses LERG (Tuan et al., 2021) that provides local explanations for dialogue response generation and identifies the L2 token. 4.2.4 DefenseExperimentsandResults We perform automatic and human experiments and report the effectiveness of various defense mechanisms in their ability to reduce toxicity as well as preserving conversational flow. We com- pare the Two-stage Non Sequitur defense proposed in Xu et al. (2020), along with our introduced framework (Mehrabi et al., 2022a) demonstrated in Figure 4.8, to a Trigger Masking (TM) baseline in which we consider masking the adversarial trigger tokens. Note that once the defender gets attacked, the defender does not generally know which tokens were the trigger-tokens used by the adversary, so the Trigger Masking (TM) approach is not applicable in realistic settings. However, we believe that considering this type of oracle baseline can still give us interesting insights to be included in our experiments. We report “defense effectiveness” as the percent decrease in toxic response generation when the defense is applied compared to when it isn’t after adversary’s attack. Results from automatic experiments in Figure 4.9 demonstrate that bothourproposeddefensemechanismaswell astheNonSequiturbaselineachieve100%defenseeffectiveness according to Toxic-bert classifier. Our introduced defense is outperforming the oracle Trigger Masking. This is because using model interpretability can give us more valuable insights than blindly masking the triggers. 4 https://github.com/cdpierse/transformers-interpret 38 Some tokens generated after the trigger can be more toxic and decisive in forcing the defender into generating toxic utterances than the adversarial trigger. This highlights the importance of having an interpretable defense mechanism. As expected, the Non Sequitur defense is always effective as it replaces the toxic utterance with a non-toxic templated utterance by changing the topic; however, this approach is not necessarily creating the best conversational experience as also verified by our human experiments in terms of maintaining relevancy and coherency of the conversation. UTSC-1 UTSC-2 UTSC-3 UTSC-1 w T oxic-bert UTSC-2 w T oxic-bert UTSC-3 w T oxic-bert UTSC-1 w Perspective API UTSC-2 w Perspective API UTSC-3 w Perspective API UTSC-1 w Safety classifier UTSC-2 w Safety classifier UTSC-3 w Safety classifier UTSC-1 w Perspective +Safety UTSC-2 w Perspective +Safety UTSC-3 w Perspective +Safety 0 20 40 60 80 100 % Defense Effectiveness Ours Non Sequitur TM Figure 4.9: Effectiveness of different defense mechanisms against different attack strategies using different toxicity classifiers according to Toxic-bert classifier. To perform human evaluations, we asked Amazon Mechanical Turk (AMT) workers to evaluate the defense quality in terms of relevancy and fluency, coherency of the overall conversation, and the toxicity of the defense utterance. 27 conversations were rated from each of the three defenses (Trigger Masking (TM), Two-stage Non Sequitur, and our introduced defense). 3 AMT workers rated each conversation. This gave us 243 annotations 81 from each defense. Figure 4.10 demonstrates the results from human experiments. Our introduced defense is rated to have the highest fluency and relevancy scores. While this defense is mostly rated to have moderate to good ratings for relevancy, the Non Sequitur defense has poor relevancy scores. This is because the Non Sequitur defense changes the topic each time a toxic utterance is generated which lowers the quality of the conversational experience. Thus, even if the Non Sequitur defense can be effective in reducing the toxicity as it replaces the toxic utterance with a non-toxic templated sentence, it can create poor conversational experience as rated by human annotators. 39 Ours Non Sequitur TM 1 2 3 Coherency Bad (score 1) Moderate (score 2) Good (score 3) Ours Non Sequitur TM 1 2 3 Fluency Bad (score 1) Moderate (score 2) Good (score 3) Ours Non Sequitur TM 1 2 3 Relevancy Bad (score 1) Moderate (score 2) Good (score 3) Ours Non Sequitur TM 0.0 0.2 0.4 0.6 0.8 1.0 T oxicity T oxic Not toxic Figure 4.10: Human experimental results performed over various defense mechanisms. Black dotted line represents the average score for a given quality ranging from 1 to 3 indicating bad to good quality. Each bar plot demonstrates proportion of workers that rated a particular score (red for bad, yellow for moderate, and green for good). Toxicity ratings are binary. Perspective API Safety classifier 0 10 20 30 40 50 60 70 % Defense Effectiveness UAT UAT-LM UTSC-1 UTSC-2 UTSC-3 Figure 4.11: Transferability of our introduced defense mechanism to the Perspective API and Safety classifier for different attacks. In addition, we discuss the results from transferability of our defense mechanism to different toxicity classifiers as well as to human performed attacks. 1. Transferability of the defense to other toxicity detection classifiers: Results in Figure 4.11 demonstrate that even if the defender utilizes the interpretability results from Toxic- bert classifier, it can still be effective in reducing toxicity according to Perspective API and Safety classifier on all attacks. 2. TransferabilityofthedefensewhenUTSCattackusesdifferenttoxicityclassifier than what the defender uses in its defense: We also report that even if the defender and the attacker do not use the same toxicity classifiers, the defense can still be effective. Results of the performed defense on all the combinations of toxicity detectors used by the attacker for its selection criteria is demonstrated in Figure 4.12. 3. Transferabilityofthedefensetohumangeneratedattacks: To make sure that the introduced defense can also transfer to human generated attacks and not just automatic attacks, we tried to manually generate attacks against the DialoGPT model and converse with it as the 40 Perspective API Safety classifier 0 10 20 30 40 50 60 70 % Defense Effectiveness UAT UAT-LM UTSC-1 w Perspective+Safety UTSC-2 w Perspective+Safety UTSC-3 w Perspective+Safety Perspective API Safety classifier 0 10 20 30 40 50 60 % Defense Effectiveness UAT UAT-LM UTSC-1 w Perspective API UTSC-2 w Perspective API UTSC-3 w Perspective API Perspective API Safety classifier 0 10 20 30 40 50 60 % Defense Effectiveness UAT UAT-LM UTSC-1 w T oxic-bert UTSC-2 w T oxic-bert UTSC-3 w T oxic-bert Perspective-API Safety-classifier 0 10 20 30 40 50 60 % Defense Effectiveness UAT UAT-LM UTC1 w Safety-classifier UTC2 w Safety-classifier UTC3 w Safety-classifier Figure 4.12: Transferability of our introduced defense mechanism to Perspective API and Safety classifiers for different attack strategies using different toxicity classifiers during the attack process. adversary. In line with the automatic attacks, we managed to trigger the system for 10% of the cases. We also noticed 70% reduction in toxic generation when one iteration of the introduced defense mechanism was applied on these attacks. While the Non Sequitur baseline defense (Xu et al., 2020) only works for the conversational domain, our introduced method has the advantage of working on any conditional generation task. Thus, in addition to analyzing conversational models as a form of generative model, we extended our analysis to general generative models. We show the generalizability of our defense method against non-conversational generation tasks, by conducting experiments with RealToxicityPrompts dataset (Gehman et al., 2020). Previously it has been shown that prompts in RealToxicityPrompts can force different generative models, such as GPT-2 (Radford et al., 2019), to generate toxic responses. Thus, we used our defense to test whether it can also be effective in reducing the number of toxic responses given prompts in RealToxicityPrompts in the GPT-2 model without existence of any adversary. These results would not only showcase applicability of our approach to non-conversational models, but also cases in which there is no adversary involved. To perform the experiments, we used all the 100k prompts in RealToxicityPrompts dataset and reported number of toxic generations before and after applying our defense on the GPT-2 model. Results in Figure 4.13 demonstrate that one iteration of our defense reduces the number of generated toxic responses by 81%, 31%, and 23%, according to Toxic-bert, Perspective API, and Safety classifier, respectively. Although the defense is based on Toxic-bert, the results transfer to Perspective API and Safety classifier. These results show the effectiveness of our defense in reducing toxic generations beyond conversational domain and a step toward reducing toxic 41 T oxic-bert Perspective API Safety classifier 0 2000 4000 6000 8000 Number of T oxic Generations Before Defense After Defense Figure 4.13: Number of toxic response generations before and after our introduced defense was applied to GPT-2 from the RealToxicityPrompts dataset (Gehman et al., 2020). The defense is demonstrated to reduce the number of toxic generations in GPT-2. Results on Toxic-bert show the defense results, while results on Perspective API and Safety classifier establish the transferability of our defense. generation. Notice that– unlike the previous discussions– the setup of this experiment was not adversarial, and only prompts were causing toxic generations; thus, there was no adversary involved in these sets of experiments. 4.2.5 Conclusion For a responsible AI system design, it is not only sufficient to ensure that these systems would stay away from learning existing unintentional biases or problematic patterns from the surrounding world, but also their robustness to various intentional phenomena that can trigger undesired behavior. These systems can be surrounded by different types of adversaries that can intentionally manipulate them for their benefits and have negative societal consequences. It is our responsibility to ensure that our systems are robust to these types of adversarial and intentional behaviors by first analyzing existing vulnerabilities in these systems and then proposing solutions. Examples of such adversarial phenomena that can have societal consequences can range from intentional manipulations to the data or the model to make the model output unfair all the way to manip- ulations that can trigger toxic behavior or generations in various generative models including conversational agents that surround us in different applications. 42 Chapter5 InterpretabilityasaMeanstoImproveFairness Incorporation of AI systems in consequential domains, such as healthcare and parole decision- making systems, has drawn intense scrutiny on the fairness of these methods. However, ensuring fairness on its own is often insufficient as the rationale for a contentious decision needs to be audited, understood, and defended. Thus, it is important for AI systems to not only be fair but also interpretable. If a decision made was found to be unfair, it is important to know why such unfair decision was made and moreover how to combat it in an interpretable fashion. In light of this, we propose to borrow concepts from interpretability as well as fairness domains and introduce frameworks in which we can detect features that contribute to unfair decisions and approaches to minimize the effect of detected problematic features to obtain more fair outcomes. We propose that the attention mechanism can be used to ensure fair outcomes while simul- taneously providing feature attributions to account for how a decision was made. Toward this goal, we design an attention-based model that can be leveraged as an attribution framework which can identify features responsible for both performance and fairness of the model. This framework detects problematic features through attention interventions and attention weight manipulation. Using this attribution framework, we design a post-processing bias mitigation strategy to minimize the effect of problematic features. We compare this approach with a suite of baselines. Finally, we demonstrate the versatility of our approach by conducting experiments on two distinct data types, tabular and textual. 5.1 ProblematicFeatureIdentification To identify problematic features that have negative impact on model performance and fairness in an interpretable fashion, we propose to use a classification model that incorporates the attention mechanism over the input features (Mehrabi et al., 2022b; Zhou et al., 2016). We focus on binary classification tasks and assume access to a dataset of triplets D ={𝑥 𝑖 ,𝑦 𝑖 ,𝑎 𝑖 } 𝑁 𝑖=1 , where𝑥 𝑖 ,𝑦 𝑖 ,𝑎 𝑖 are i.i.d. samples from data distribution𝑝(x,y,a). We denote sensitive or protected attributes in our dataset as a∈{𝑎 1 ,...𝑎 𝑙 } which is a discrete variable with𝑙 possible values. y∈{0, 1} denotes the true label. We denote features in our dataset as x∈R 𝑚 which may include sensitive attributes. We use ˆ 𝑦 𝑜 to denote the binary outcome of the original model. Finally, ˆ 𝑦 𝑘 𝑧 represents the binary 43 (a) Classification model. (b) Attribution framework. Figure 5.1: (a) In general classification model, for each feature 𝑓 𝑘 a vector representation𝑒 𝑘 of length𝑑 𝑒 is learned. These vector representations are passed to the attention layer which produces a𝑑 𝑒 -dimensional vector representation for the sample instance𝑖. This sample instance representation is finally passed to two dense layers to get the final classification output. (b) The attribution framework has the same architecture as the general classification model. One outcome is obtained through the original model and another through the model that has some attention weights zeroed. The observed differences in accuracy and fairness measures will indicate the effect of the zeroed out features on accuracy and fairness. outcome of a model in which the attention weights corresponding to the𝑘 th feature are zeroed out. This framework is flexible and can be used to find attributions with regards to different fairness notions (more particularly, group fairness measures like Statistical Parity (Dwork et al., 2012), Equalized Odds (Hardt et al., 2016a), and Equality of Opportunity (Hardt et al., 2016a)). We consider each feature value as an individual entity and learn a fixed-size embedding {𝑒 𝑘 } 𝑚 𝑘=1 , 𝑒 𝑘 ∈R 𝑑 𝑒 for each feature,{𝑓 𝑘 } 𝑚 𝑘=1 . These vectors are then passed to the attention layer where the attention weights along with the final representation for a sample is obtained as follows: 𝐻 = tanh(𝐸);𝛼 = softmax(𝑤 𝑇 𝐻);𝑟 = tanh(𝐸𝛼 𝑇 ) Where𝐸 =[𝑒 1 ...𝑒 𝑚 ], 𝐸∈ R 𝑑 𝑒 ×𝑚 is the concatenation of all the embeddings,𝑤 ∈ R 𝑑 𝑒 is a learnable parameter,𝑟∈R 𝑑 𝑒 denotes the overall sample representation, and𝛼∈R 𝑚 denotes the attention weights. The resulting representation,𝑟, is finally passed to the feed-forward layers for classification. The overall architecture is shown in Figure 5.1. 44 In order to identify problematic features that contribute to unfair outcomes, we can utilize the introduced classification model into an attribution framework that can identify such undesired features. The aforementioned classification model with the attention mechanism combines input feature embeddings by taking a weighted combination. By manipulating the weights, we can intuitively capture the effects of specific features on the output. Thus, to identify features respon- sible for unfair outcomes, we observe the effect of each feature on the fairness of the outcomes by zeroing out or reducing its attention weights and recording the change. While some previous work has used similar ideas to understand the effect of attention weights on accuracy and evaluated interpretability of the attention weights by comparing the difference in outcomes in terms of measures such as Jensen-Shannon Divergence (Serrano and Smith, 2019), no work has been done in this regards for fairness. We are interested in the effect of features on fairness measures and identifying problematic features responsible for unfair outcomes. To this end, we measure the difference in the fairness of the outcomes based on a desired fairness measure. A large change in fairness measure and a small change in performance of the model would indicate that the feature is mostly responsible for unfairness; thus, it can be dropped without causing large impacts on performance. The overall framework is shown in Figure 5.1. First, the outcomes are recorded with the original attention weights intact using the introduced classification model (Figure 5.1a). Next, attention weights corresponding to a particular feature are zeroed out using the attribution framework, and the difference in performance and fairness measures is recorded (Figure 5.1b). Based on the observed differences, one may conclude how incorporating this feature contributes to fairness/unfairness. To measure the effect of the 𝑘 𝑡ℎ feature on different fairness measures, we measure the difference in the fairness of the outcomes of the original model and the model with 𝑘 𝑡ℎ feature’s effect removed. For example, for statistical parity difference, we will consider SPD( ˆ y 𝑜 ,a)− SPD( ˆ y 𝑘 𝑧 ,a). A negative value will indicate that the𝑘 𝑡ℎ feature helps mitigating unfairness, and a positive value will indicate that the𝑘 𝑡ℎ feature contributes to unfairness. This is because ˆ 𝑦 𝑘 𝑧 captures the exclusion of the𝑘 𝑡ℎ feature (zeroed out attention weight for that feature) from the decision-making process. 5.2 BiasMitigationStrategy The attribution framework can help to identify features that contribute to unfair outcomes accord- ing to different fairness measures. As a result, a simple technique to mitigate or reduce bias is to reduce the attention weights of the features that were identified to contribute to unfairness using the attribution framework (Mehrabi et al., 2022b). This mitigation technique is outlined in Algorithm 4. In this algorithm, we first individually set attention weights for each of the features in all the samples to zero and monitor the effect on the desired fairness measure. If the 𝑘 𝑡ℎ feature was identified to contribute to unfairness, we reduce its attention weight using a decay rate value. This post-processing technique has various advantages. First, the post-processing approach is computationally efficient as it does not require model retraining to ensure fairness for each 45 sensitive attribute separately. Instead, the model is trained once since the attention weights manipulation happens during test time according to particular needs and use-cases. Second, this approach provides an explanation and control over the fairness-accuracy trade-off since manipulating the attention weights reveals features that are important for getting the desired outcome and the amount of their importance which provides an explanation for the outcome and also a mechanism to control the fairness-accuracy trade-off by the amount of the manipulation. Algorithm4: Bias Mitigation by Attention Input: decay rate𝑑 𝑟 (0≤𝑑 𝑟 < 1),𝑛 test samples indexed by variable𝑖. Output: final predictions, unfair features. Calculate the attention weights𝛼 𝑘𝑖 for the𝑘 th feature in sample𝑖 using the attention layer. unfair_feature_set = {} for each feature (index)𝑘 do if SPD( ˆ y 𝑜 ,a)−SPD( ˆ y 𝑘 𝑧 ,a)≥ 0then unfair_feature_set = unfair_feature_set∪{𝑘} end end for each feature (index)𝑘 do if𝑘 in unfair_feature_setthen Set𝛼 𝑘𝑖 ←(𝑑 𝑟 ×𝛼 𝑘𝑖 ) for all𝑛 samples end end Use new attention weights to obtain the final predictions ˆ 𝑌 . return ˆ 𝑌 , unfair_feature_set 5.3 ExperimentsandResults We perform various experiments to show the validity of the introduced attribution framework in identifying problematic features as well as bias mitigation strategy and its ability in reducing bias. In addition, we show the applicability of the introduced approach on textual data. 5.3.1 ValidatingAttributionFramework We will first demonstrate that our attribution framework can capture correct attributions of features to fairness of the outcomes. To this end, we create two synthetic datasets in which we control how features interact with each other and contribute to the accuracy and fairness of the outcome variable. These synthetic datasets capture common scenarios in fair classification problems, such as data imbalance and indirect discrimination. Scenario1: First, we create a simple scenario to demonstrate that our framework identifies correct feature attributions for fairness and accuracy. We create a feature that is correlated with the outcome (responsible for accuracy), a discrete feature that causes the prediction outcomes to 46 be biased (responsible for fairness), and a continuous feature that is independent of the label or the task (irrelevant for the task). If the attention-based attribution framework works correctly, we expect to see a reduction in accuracy upon removing the effect of the feature responsible for the accuracy, reduction in bias upon removing the feature responsible for bias, and very little or no change upon removing the irrelevant feature. With this objective, we generate a synthetic dataset with three features, i.e.,𝑥 =[𝑓 1 ,𝑓 2 ,𝑓 3 ] as follows 1 : 𝑓 1 ∼ Ber(0.9) 𝑓 2 ∼ Ber(0.5) 𝑓 3 ∼N(0, 1) 𝑦∼ ( Ber(0.9) if𝑓 2 = 1 Ber(0.1) if𝑓 2 = 0 In Scenario 1,𝑓 2 has the most predictive information for the task and is responsible for accuracy. We consider𝑓 1 as the sensitive attribute which is an imbalanced feature that can bias the outcome and is generated such that there is no intentional correlation between𝑓 1 and the outcome,𝑦 or 𝑓 2 . 𝑓 3 is sampled from a normal distribution independent of the outcome𝑦, or the other features, making it irrelevant for the task. Thus, an ideal classifier would be fair if it captures the correct outcome without being affected by the imbalance in 𝑓 1 . However, due to limited data and skew in 𝑓 1 , there will be some undesired bias — few errors when𝑓 1 = 0 can lead to large statistical parity. Scenario2: Sometimes features that are not identified as sensitive attributes can result in unfair decisions due to their implicit relations or correlations with the sensitive attributes. This phenomenon is called indirect discrimination (Zliobaite, 2015; Hajian and Domingo-Ferrer, 2013; Zhang et al., 2017). We designed this synthetic dataset to demonstrate and characterize the behavior of our framework under indirect discrimination. Similar to the previous scenario, we consider three features. Here,𝑓 1 is considered as the sensitive attribute, and𝑓 2 is correlated with 𝑓 1 and the outcome,𝑦. The generative process is as follows: 𝑓 1 ∼ ( Ber(0.9) if𝑓 2 = 1 Ber(0.1) if𝑓 2 = 0 𝑓 2 ∼ Ber(0.5) 𝑓 3 ∼N(0, 1) 𝑦∼ ( Ber(0.7) if𝑓 2 = 1 Ber(0.3) if𝑓 2 = 0 In this case𝑓 1 and𝑦 are correlated with𝑓 2 . The model should mostly rely on𝑓 2 for its decisions. However, due to the correlation between 𝑓 1 and 𝑓 2 , we expect 𝑓 2 to affect both the accuracy and fairness of the model. Thus, in this case, indirect discrimination is possible. Using such a synthetic dataset, we demonstrate a) indirect discrimination and b) the need to have an attribution framework to reason about unfairness and not blindly focus on the sensitive attributes for bias mitigation. Results in Figure 5.2 demonstrate that in Scenario 1, as expected,𝑓 2 is correctly attributed to being responsible for the accuracy and removing it hurts the accuracy drastically. Similarly,𝑓 1 is 1 We use𝑥∼ Ber(𝑝) to denote that𝑥 is a Bernoulli random variable with𝑃(𝑥 = 1)=𝑝. 47 correctly shown to be responsible for unfairness and removing it creates a fairer outcome. Ideally, the model should not be using any information about𝑓 1 as it is independent of the task, but it does. Therefore, by removing𝑓 1 , we can ensure that information is not used and hence outcomes are fair. Lastly, as expected,𝑓 3 was the irrelevant feature, and its effects on accuracy and fairness are negligible. InScenario2, the attribution framework captures the effect of indirect discrimination. Removing 𝑓 2 reduces bias as well as accuracy drastically. This is because 𝑓 2 is the predictive feature, but due to its correlation with𝑓 1 , it can also indirectly affect the model’s fairness. More interestingly, although 𝑓 1 is the sensitive feature, removing it does not play a drastic role in fairness or the accuracy. This is an important finding as it shows why removing 𝑓 1 on its own can not give us a fairer model due to the existence of correlations to other features and indirect discrimination. Overall, results validate our assumption that attention-based framework can provide reliable feature attributions for the fairness and accuracy of the model. 0.5 0.6 0.7 0.8 0.9 Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 SPD y 1 z y 2 z y 3 z y o y {1, 2} z Attribution Visualization for Synthetic Scenario 1 0.50 0.53 0.56 0.59 0.62 0.65 Accuracy 0.2 0.4 0.6 0.8 1.0 SPD y 1 z y 2 z y 3 z y o y {1, 2} z Attribution Visualization for Synthetic Scenario 2 Figure 5.2: Results from the synthetic datasets analyzing validity of the attribution framework in various scenarios. 5.3.2 ValidatingBiasMitigationStrategy As highlighted earlier, understanding how the information within features interact and contribute to the decision making can be used to design effective bias mitigation strategies. Oftentimes real-world datasets have features which cause indirect discrimination, due to which fairness can not be achieved by simply eliminating the sensitive feature from the decision process. Using the attributions derived from our attention-based attribution framework, we propose a post-processing mitigation strategy. We first attribute and identify the features responsible for the unfairness of the outcomes, i.e., all the features whose exclusion will decrease the bias compared to the original model’s outcomes and gradually decrease their attention weights to zero as also outlined in Algorithm 4. We do this by first using the whole fraction of the attention weights learned and gradually use less fraction of the weights until the weights are completely zeroed out. To then validate whether our bias mitigation strategy is effective in reducing bias, we use various baselines and datasets. For baselines, we consider methods that learn representations of 48 0.76 0.78 0.80 0.82 0.84 Accuracy 0.00 0.05 0.10 0.15 0.20 Statistical Parity Difference Accuracy vs SPD (UCI Adult) FCRL CVIB MaxEnt-ARL Adversarial Forgetting Attention (Ours) MIFR LAFTR 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Statistical Parity Difference Accuracy vs SPD (Heritage Health) FCRL CVIB MaxEnt-ARL Adversarial Forgetting Attention (Ours) MIFR Figure 5.3: Accuracy vs parity curves for UCI Adult and Heritage Health datasets. data so that information about sensitive attributes is eliminated. CVIB (Moyer et al., 2018a) realizes this objective through a conditional variational autoencoder, whereasMIFR (Song et al., 2019) uses a combination of information bottleneck term and adversarial learning to optimize the fairness objective. FCRL (Gupta et al., 2021) optimizes information theoretic objectives that can be used to achieve good trade-offs between fairness and accuracy by using specialized contrastive information estimators. In addition to information-theoretic approaches, we also considered baselines that use adversarial learning such asMaxEnt-ARL (Roy and Boddeti, 2019),LAFTR (Madras et al., 2018), andAdversarialForgetting (Jaiswal et al., 2020). Note that in contrast to the introduced approach, the baselines described are not interpretable as they are incapable of directly attributing features to fairness of the outcomes. For all the baselines, we used the approach outlined in Gupta et al. (2021) for training a downstream classifier and evaluating the accuracy/fairness trade-offs. The downstream classifier was a 1-hidden-layer MLP with 50 neurons along with ReLU activation function. Each method was trained with five different seeds, and we report the average accuracy and fairness measure as statistical parity difference (SPD). LAFTR can only deal with binary sensitive attributes and thus not applicable for Heritage Health dataset. We report the results on UCI Adult and Heritage Health datasets. Figure 5.3 compares fairness-accuracy trade-offs of different bias mitigation approaches. We desire outcomes to be fairer, i.e., lower values of SPD and to be more accurate, i.e., towards the right. The results show that using attention attributions can indeed be beneficial for reducing bias. Moreover, the introduced mitigation framework based on the manipulation of the attention weights is competitive with state-of-the-art mitigation strategies. However, most of these approaches are specifically designed and optimized to achieve parity and do not provide any interpretability. The introduced model can not only achieve comparable and competitive results, but it is also able to provide explanation such that the users exactly know what feature and by how much it was manipulated to get the corresponding outcome. Another advantage of the introduced 49 Method DentistTPRD(stdev) NurseTPRD(stdev) Accuracy(stdev) Post-Processing (Ours) 0.0202(0.010) 0.0251(0.020) 0.951 (0.013) Pre-Processing 0.0380 (0.016) 0.0616 (0.025) 0.946 (0.011) Not Debiased Model 0.0474 (0.025) 0.1905 (0.059) 0.958(0.011) Table 5.1: Difference of the True Positive Rates (TPRD) amongst different genders for the dentist and nurse occupations on the biosbias dataset. Our introduced post-processing method is the most effective in reducing the disparity for both occupations compared to the pre-processing technique. approach is that it needs only one round of training. The adjustments to attention weights are made post-training; thus, it is possible to achieve different trade-offs. Moreover, the introduced approach does not need to know sensitive attributes while training; thus, it could work with other sensitive attributes not known beforehand or during training. 5.3.3 ValidatingApplicabilityonTextualData In addition to providing interpretability, the introduced approach is flexible and useful for control- ling fairness in modalities other than tabular datasets. To put this to the test, we applied our model to mitigate bias in text-based data. We consider the biosbias dataset (De-Arteaga et al., 2019), and use the mitigation technique to reduce observed biases in the classification task performed on this dataset. We compare our approach with the debiasing technique proposed in the original paper (De-Arteaga et al., 2019), which works by masking the gender-related words and then training the model on this masked data. As discussed earlier, such a method is computationally inefficient. It requires re-training the model or creating a new masked dataset, each time it is required to debias the model against different attributes, such as gender vs. race. For the baseline pre-processing method, we masked the gender-related words, such as names and gender words, as provided in the biosbias dataset and trained the model on the filtered dataset. On the other hand, we trained the model on the raw bios for our post-processing method and only manipulated attention weights of the gender words during the testing process as also provided in the biosbias dataset. In order to measure the bias, we used the same measure as in (De-Arteaga et al., 2019) which is based on the equality of opportunity notion of fairness (Hardt et al., 2016a) and reported the True Positive Rate Difference (TPRD) for each occupation amongst different genders. As shown in Table 5.1, the introduced post-processing mitigation technique provides lower TRPD while being more accurate, followed by the technique that masks the gendered words before training. Although both methods reduce the bias compared to a model trained on raw bios without applying any mask or invariance to gendered words, our post-processing method is more effective. Figure 5.4 also highlights qualitative differences between models in terms of their most attentive features for the prediction task. As shown in the results, the introduced post-processing technique is able to use more meaningful words, such as R.N. (registered nurse) to predict the outcome label nurse compared to both baselines, while the non-debiased model focuses on gendered words. 50 Figure 5.4: Qualitative results from the non-tabular data experiment on the job classification task based on bio texts. Green regions are the top three words used by the model for its prediction based on the attention weights. While the Not Debiased Model mostly focuses on gendered words, our method focuses on profession-based words, such as R.N. (Registered Nurse), to correctly predict “nurse.” 5.4 Conclusion Although mitigating bias in AI systems is desired, it is also important to design approaches that are interpretable. Not only the designed approaches need to be interpretable, but also reason why bias exists in the first place needs to be transparent. In light of this, we discussed how attention weights contribute to fairness and accuracy of a predictive model. We introduced an attribution method that leverages the attention mechanism and showed the effectiveness of this approach on both tabular and text data. Using this interpretable attribution framework, we then introduced a post-processing bias mitigation strategy based on attention weight manipulation. We validated the proposed framework by conducting experiments with different baselines, fairness metrics, and data modalities. 51 Chapter6 ResolvingAmbiguitythroughHumanFeedback Although it is a necessary condition for humans to understand how models work, interpretability is not a sufficient criterion in designing a responsible AI system. As much as it is important for humans to understand how models make decisions, it is important for models to also accurately understand the human intention. There are various reasons why it is possible for AI systems to not accurately reflect human intention. One of those reasons is when humans use ambiguous or under-specified language. Ambiguities can pose various issues. Thus, it is important to study them and their effects on machine outcomes. We study the effect of ambiguity in text-to-image generative models (Ramesh et al., 2021, 2022; Saharia et al., 2022; Yu et al., 2022) and demonstrate that ambiguous prompts provided to such models might result in undesired outcomes and poor user experience. In particular, ambiguities due to underspecification can lead to biased outcomes with possible implications on fairness of the underlying models (e.g., when prompted with “doctor talking to a nurse”, the model might generate images with disproportionate number of male doctors and female nurses.) We also propose a framework for mitigating ambiguities existing in prompts. Our proposed framework allows to disambiguate the initial ambiguous prompt coming from our curated benchmark dataset through clarifying signals obtained via human-AI interaction. These signals are obtained by utilizing a language model that engages with the human-agent. Specifically, the framework either (1) generates clarifying questions for a human-agent to provide clarifying answers; or (2) generates different possible setups that can disambiguate the prompt allowing the human-agent to pick the appropriate setup that matches the human intention. The framework is illustrated in Figure 6.1. 6.1 BenchmarkDataset The original set of ambiguous prompts, for which we intend to get a resolution, are coming from a benchmark dataset. The curated benchmark dataset is an extended and modified version of the LAVA corpus (Berzak et al., 2015). Original LAVA corpus contains various types of ambiguous sentences that are possible to be visually detected and inspected along with their corresponding images/videos. We use the ambiguous prompts (templates) from LAVA and not the images– as images in our case would be generated automatically by text-to-image generative models. 52 Initial Prompt Prompt disambiguation Disambiguated Prompt Text-to-image Model Initial Prompt: An elephant and a bird flying. Disambiguated Prompt: An elephant and a bird flying. The elephant is flying. Is the elephant flying? Yes, the elephant is flying. Initial Prompt: An elephant and a bird flying. Disambiguated Prompt: An elephant and a bird flying. The elephant is flying. Possible setups are: 1. The elephant is flying. 2. The elephant is not flying. I choose setup 1. Human’s Intention: The elephant is flying. Human’s Intention: The elephant is flying. Generating Clarifying Questions Generating Different Visual Setups Figure 6.1: Our proposed disambiguation framework. The initial ambiguous prompt is disam- biguated by either (1) the language model generating clarifying question which will be resolved through human provided answers, or (2) the language model generating different possible visual setups and human-agent choosing the desired setup. The final disambiguated prompt will later be provided to the downstream text-to-image generative model. Original LAVA corpus covers 237 ambiguous sentences (prompts) and 498 visual setups (possible interpretations for each ambiguous sentence). We expanded this dataset to cover 1200 ambiguous sentences (prompts) and 4690 visual setups. In addition to expanding LAVA, we made various modifications to the LAVA dataset, such as (i) diversifying LAVA corpus to cover different objects, scenes, and scenarios, (ii) removing examples relevant to video domain and only keeping examples relevant to static images, (iii) adding fairness prompts that cover different activities (Zhao et al., 2017) and occupations (Nadeem et al., 2021) in which the identities of the individuals are ambiguous, (iv) adding more structurally complex sentences, and (v) additional labels for our dataset (e.g., whether the visual setup or interpretation of an ambiguous sentence is commonsensical or not). On a high level, the benchmark dataset covers six main types of prompt ambiguities, including fairness and linguistic type ambiguities. We also added some additional complex cases. In the complex cases, we took a sample of prompts from the original benchmark and manually made structurally more complex version of each sentence. This process is done in a way such that the ambiguities are kept intact so is the meaning of the sentence but the structure of a sentence was made more complex through addition of more information, extra words, adverbs, and adjectives. We also added some additional miscellaneous cases, which were not covered by six main types of ambiguities, as well as combination cases where we combined fairness and linguistic type ambiguities and made new variations from our existing prompts. Additional details can be found in Table 6.1. 53 AmbiguityType Count Main Types Syntax Prepositional Phrase (PP) 74 Linguistic Syntax Verb Phrase (VP) 243 Syntax Conjunction 127 Discourse Anaphora 21 Discourse Ellipsis 45 Fairness 355 o Fairness Add Complex+combination+misc 335 Table 6.1: Breakdown of our benchmark dataset by ambiguity types. The benchmark consists of six types of ambiguities, including linguistic and fairness. We cover syntactic as well as discourse type ambiguities for linguistic type ambiguities. It also contains complex version for subset of the samples from the main type ambiguities with structurally more complex sentences, combination cases that combine fairness and linguistic type ambiguities, and some miscellaneous cases. 6.2 DisambiguationFramework After obtaining the initial ambiguous prompts from our benchmark dataset, we utilize the capabili- ties of different language models to resolve existing ambiguities and obtain external disambiguation signals through human interaction using few-shot learning techniques, as discussed below. 6.2.1 Method In our disambiguation framework, the language model will be provided by a few examples on either (1) how to generate clarifying questions given some ambiguous prompts; or (2) how to generate different possible visual setups given an ambiguous visual-scene description. After presented by those examples, the model will try to generalize this ability to other examples during the inference time. In our experiments, we provided each of the three language models that we tested (GPT-2 (Radford et al., 2019), GPT-neo (Black et al., 2021), OPT (Zhang et al., 2022)) one example from each of the main six types of ambiguities existing in our benchmark dataset. We then reported the ability of the language model in either generating a clarifying question per ambiguous prompt or multiple possible visual interpretations per prompt. We consider each of these abilities as separate setups under which a language model will try to obtain external signals through human interaction. 6.2.2 Experiments To validate the ability of language models in generating clarifying questions, after obtaining the results from the language model, we compare the generations to the ground truths in our benchmark dataset using BLEU and ROUGE automatic metrics. In addition, to evaluate validity of 54 GPT-2 GPT-neo OPT Ambiguity Type BLEU↑ ROUGE↑ BLEU↑ ROUGE↑ BLEU↑ ROUGE↑ Total Benchmark 0.39 0.58 0.46 0.60 0.42 0.59 Syntax Prepositional Phrase (PP) 0.21 0.64 0.06 0.63 0.22 0.65 Syntax Verb Phrase (VP) 0.60 0.81 0.75 0.84 0.67 0.83 Syntax Conjunction 0.17 0.63 0.23 0.65 0.06 0.56 Discourse Anaphora 0.30 0.69 0.19 0.60 0.74 0.83 Discourse Ellipsis 0.48 0.69 0.22 0.47 0.55 0.75 Fairness 0.36 0.55 0.60 0.59 0.50 0.58 Table 6.2: BLEU and ROUGE scores obtained by different LMs on generating a clarifying question in 6-shot setup given an ambiguous prompt.↑ indicates that higher values are desired. Scores are reported on a 0-1 scale. the automatic metrics and their alignment to human results, we perform the human interaction experiments with the language model in which the human-agent provides disambiguation signals to the language model where appropriate. In this case, the response (disambiguation signal) is provided to the language model if the generated clarifying question is helpful in disambiguating the prompt according to the provided ground truth interpretation of the prompt from the benchmark dataset and left unanswered otherwise. These human interaction experiments will serve two purposes: 1. We can measure how aligned human obtained results are to automatic metrics to evaluate quality of generations by the utilized language models. 2. We will obtain the disambiguation signals that are later needed for comparing images generated by the text-to-image generative models using original ambiguous prompts vs images generated from prompts that are disambiguated given external signals. We perform the same set of experiments for the case of directly generating multiple visual setups instead of generating clarifying questions. 6.2.3 Results We report the results for language models generating one clarifying question as well as multiple visual setups per given prompt. Results in Table 6.2 demonstrate that language models have reasonable ability to generalize and generate good quality clarifying questions compared to our ground truth when provided with an ambiguous prompt according to BLEU (∼ 0.40) and ROUGE (∼ 0.60) metrics. In addition to reporting overall results on our overall benchmark dataset, fine-grained results for the six different ambiguity types, as reported in Table 6.2, suggest that there exists disparity in how different ambiguities are handled by each of these language models. For instance, language models obtain higher BLEU and ROUGE scores on generating clarifying question for ambiguity type Syntax Verb Phrase (VP) than ambiguity type Syntax Propositional Phrase (PP). This suggests that some types of ambiguities are easier for the language models to resolve compared to others 55 GPT-2 GPT-neo OPT Ambiguity Type BLEU ROUGE BLEU ROUGE BLEU ROUGE Total Benchmark 0.23 0.52 0.20 0.44 0.31 0.60 Syntax Prepositional Phrase (PP) 0.07 0.61 0.06 0.58 0.07 0.60 Syntax Verb Phrase (VP) 0.39 0.80 0.30 0.69 0.39 0.81 Syntax Conjunction 0.15 0.64 0.14 0.56 0.12 0.67 Discourse Anaphora 0.0 0.57 0.06 0.47 0.0 0.76 Discourse Ellipsis 0.0 0.58 0.14 0.60 0.20 0.76 Fairness 0.29 0.50 0.19 0.41 0.40 0.60 Table 6.3: Automatic results from language models directly generating multiple visual setups. VP PP Conjunction Anaphora Ellipsis Fairness 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Human BLEU ROUGE Figure 6.2: Percentage of generations by GPT-neo that were successful according to human- agent and its comparison to BLEU and ROUGE automatic metrics for the one clarifying question generation setup. although they see similar number of examples provided per ambiguity type as few-shot examples during the inference time. Results from our human interaction experiments are shown in Figure 6.2. In these results, we report the percentage of generations by the GPT-neo language model that deemed successful according to the human-agent interacting with this model. In other words, we report the percentage of generations that an answer was provided by the human-agent and was not left unanswered. An unanswered generation would mean that the generation was not helpful for disambiguating the prompt and thus left unanswered by the human-agent. We perform this experiment on GPT-neo model as we obtained best automatic results for this model. From our human evaluation results, we report the agreement between automatic metrics and human evaluation results with Pearson correlation between human and ROUGE score of 0.863 and between human and BLEU score of 0.546. The same set of results except for the case when the language models generate multiple visual setups are reported in Table 6.3 and Figure 6.3. 56 VP PP Conjunction Anaphora Ellipsis Fairness 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Human BLEU ROUGE Figure 6.3: Percentage of generations by GPT-neo that were successful according to human- agent and its comparison to BLEU and ROUGE automatic metrics for the multiple visual setup generations. 6.3 EvaluationFramework To evaluate effectiveness of the disambiguation framework in allowing text-to-image generative models to generate more faithful images aligned with human intention, we compared generated images from OpenAI’s DALL-E model (Ramesh et al., 2022) as well as DALL-E Mega (Dayma et al., 2021) using disambiguated prompts vs the original ambiguous prompts. 6.3.1 Method After obtaining the disambiguation signals through language model vs human interactions, we concatenated this external disambiguation signal to the original ambiguous prompt. To observe the effectiveness of this disambiguation framework in enabling text-to-image models to generate more faithful images aligned with human intention, we compared the generations given the original ambiguous prompts to the ones that were disambiguated. For each prompt, four images were generated by OpenAI’s DALL-E (Ramesh et al., 2022) and DALL-E Mega (Dayma et al., 2021) models. After obtaining generated images per prompt from each of these models, to automatically evaluate faithfullness of generations to human intention as well as effectiveness of our proposed framework, we used the VILT Visual Question Answering (VQA) model (Kim et al., 2021). Our benchmark provides each image with human intention in the question format. We use both the image and its corresponding question as inputs to the VQA model as shown in Figure 6.4. We then compare the results generated using the initial ambiguous prompts and the disambiguated prompts. Ideally, if the image aligns to human intention, we would expect the VQA model to output a "Yes" as an answer to the question. Thus, we report the percentage of times the VQA model outputs a "Yes" as an answer as the percentage of faithful generations aligned with human intention amongst all the generations. 57 Initial Prompt: An elephant and a bird flying. Disambiguated Prompt: An elephant and a bird flying. The elephant is flying. Text-to-image Model Text-to-image Model Question (Human Intention): Is the elephant flying? VQA VQA No Yes Answer does not match human intention Answer matches human intention Figure 6.4: Introduced automatic evaluation framework using a VQA model. Given an ambiguous and a disambiguated prompt, we compare the generations made by the text-to-image model according to a VQA model. As inputs to the VQA model, we provide the human intention in the question format as well as the generated images from each prompt. Finally, to evaluate how reliable the proposed automatic evaluation method is in evaluating faithfulness of generated images, we perform human evaluations. Human evaluations serve two purposes. First, we want to evaluate whether our proposed automatic evaluation method using a VQA model is a reliable method to measure faithfullness of generated images in text-to-image generative models. Second, we want to evaluate the effect of faithfull generations and ultimately the effectiveness of our framework as well as disambiguating the prompts using a human instead of automatic metrics. The human evaluation experiments were performed on Amazon’s mechanical turk platform. Overall, 400 images were annotated by mturk workers. Each image was annotated by three workers; thus, we obtained in total 1200 annotations. Lastly, we explore the effect that paraphrasing the disambiguated prompts can have on creating prompts more aligned with human intention and hence more faithful image generation. Here, we take all the disambiguated prompts obtained through our framework, which were concatenation of disambiguated signals provided by humans to the ambiguous prompts, and apply sentence paraphrasing model fine-tuned on BART over them (Lewis et al., 2020a). We then compare the results from providing the text-to-image model the ambiguous prompt vs the disambiguated prompt, which was obtained from simple concatenation of human provided signal to the original prompt, vs a paraphrased version of the disambiguated prompt from the previous step. We report the effect on whether this paraphrasing helps into the model generating more faithful images to human intention. 58 VP PP Conjunction Anaphora Ellipsis Fairness Overall 0.0 0.2 0.4 0.6 0.8 DALL-E Mega Paraphrased Disambiguated Ambiguous VP PP Conjunction Anaphora Ellipsis Fairness Overall 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 OpenAI DALL-E Disambiguated Ambiguous Figure 6.5: Percentages of faithful image generations by DALL-E Mega and OpenAI’s DALL-E according to automatic evaluation using a VQA model. VP PP Conjunction Anaphora Ellipsis Fairness Overall 0.0 0.2 0.4 0.6 0.8 1.0 DALL-E Mega Paraphrased Disambiguated Ambiguous VP PP Conjunction Anaphora Ellipsis Fairness Overall 0.0 0.2 0.4 0.6 0.8 1.0 OpenAI DALL-E Disambiguated Ambiguous Figure 6.6: Percentage of faithful generations by DALL-E Mega and OpenAI’s DALL-E from human evaluations. 6.3.2 Results First, we demonstrate the effectiveness of proposed disambiguation framework into generating faithful images aligned with human intention according to human evaluations in OpenAI’s DALL- E and DALL-E Mega models. From the evaluation results reported on the overall bar including all the ambiguity types combined in Figure 6.6, we demonstrate that disambiguation overall helps with faithful generations by improving the results from random baseline (using the original ambiguous prompts which can hit the correct interpretation half of the times with no human supervision by chance given that the text-to-image model generates a reasonable image). Although the results are improved on the overall cases collectively, the fine-grained results in Figure 6.6 demonstrate that disambiguation might have adverse effect on some ambiguity types (e.g., PP type ambiguity). In addition, we observe that it is harder to generate faithful images that can even hit the random chance of generating the correct interpretations for some ambiguity types (e.g., Ellipsis) due to the complexity of the prompts in this ambiguity category for text-to-image 59 Ambiguous Disambiguated Prompt: An image of a florist. Prompt: An image of a florist. The florist X. X=is a male X=is a female X=is an old man X=has a dark skin color Prompt: An image of a diplomat. Prompt: An image of a diplomat. The diplomat X. X=is a male X=is a female X=is an old man X=has a dark skin color Figure 6.7: Fairness qualitative examples from OpenAI’s DALL-E (top row) and DALL-E Mega (bottom row). generative models. Typically the Ellipsis category requires many entities being generated in the images in complex fashions that current text-to-image models are not capable of achieving it yet. These two observations can help future work to not only improve existing text-to-image models in general to handle more complex cases, but also effective methods to disambiguate certain harder types of ambiguities that were not effectively handled by our proposed method. In addition, we report the annotator agreement according to Fleiss Kappa (Fleiss, 1971) to be 0.86 which is a significant agreement. Second, we show similar results for the proposed automatic evaluation method in this work to those of humans. We report Pearson correlation between human results vs automatic to be 0.83 and 0.75 for DALL-E Mega and OpenAI’s DALL-E respectively. This shows that the proposed automatic metric can be in agreement with human annotators and can be used to avoid cost and time that human evaluation can impose on the task of evaluation. Figure 6.7 also demonstrates the effect that disambiguation has on generating more diverse images with implications on fairness. By specifying more specific identities associated to an individual and avoiding general ambiguous prompts, more diverse images can be generated. We believe that having a language model that is aware of these ambiguities and provides the user the opportunity to specify its intention more clearly can improve both user satisfaction as well as resulting these models to generate more diverse images with implications on fairness. Lastly, we report the effect of paraphrasing the disambiguated prompts over simply concatenat- ing the disambiguation signal to the end of the ambiguous prompts. Figure 6.5 demonstrates that paraphrasing disambiguated prompts can overall have very slight and not significant improvement according to automatic VQA evaluation method using DALL-E Mega model and slightly more significant improvements according to human evaluations as shown in Figure 6.6. We report these sets of results only on DALL-E Mega as most of the results follow similar patterns for both models 60 (DALL-E Mega and OpenAI’s DALL-E models); moreover, we had quota limitations accessing OpenAI’s DALL-E model; thus, we were unable to generate images for these sets of paraphrased prompts. 6.4 Conclusion An important element in a responsible AI system design is the ability of the models to faithfully satisfy human intention. A scenario in which human intention might remain unsatisfied is if the prompt contains ambiguity in the first place in which it would be hard for the model to generate faithful and aligned outcomes to the human intention. In light of this, we studied the role of prompt ambiguity in text-to-image generative models and proposed a disambiguation framework for helping the model to generate more faithful images that are better aligned with user intention. We then measured the ability of various language models into obtaining disambigutating signals through human interaction by either generating clarifying questions or directly generating multiple possible visual setups utilizing concepts from few-shot learning. After obtaining the signals via language model vs human interaction and performing different automatic as well as human evaluations, we measured the faithfulness of image generations by text-to-image generative models providing ambiguous, disambiguated, as well as paraphrased disambiguated prompts to these systems. We performed various automatic as well as human evaluations to further prove our claims in the paper. 61 Chapter7 InternalizingEthicsinAISystems Thus far we have talked about designing responsible AI systems considering various concepts from responsible AI literature (e.g., robustness) and their intersection (e.g., fairness and interpretability); however, one question remains unsolved. Can we design an ultimate AI system that can internalize ethics and reason about consequences of the assigned tasks or design choices of these systems on humans and their surroundings for an ultimate responsible AI design? Toward the goal of internalizing ethics in AI systems, we ask the following question: Can machines reason about the broader societal implications of their solutions? To help us answer this question, we introduce the Broader Impact Dataset (BID), which consists of technical papers (abstract, technical description, summary features extracted from body), broader impact statements, and question/answer pairs about broader and ethical implications of the work. We define two interrelated tasks on BID: (1) Generating broader impact statements about a research problem from the abstract and other features from the technical description of the work; (2) Answering broader impact questionnaire using the broader impact statements. We benchmark several baselines on these two tasks and analyze their areas of strengths and weaknesses. The constructed dataset and the defined tasks can help to advance internalizing ethics in AI systems. 7.1 BroaderImpactDataset(BID) Toward the goal of internalizing ethics in AI systems, we curated a benchmark dataset called the Broader Impact Dataset (BID). BID consists of two interrelated datasets: (i)BID-Gen. A dataset of different research papers from different Artificial Intelligence (AI), Natural Language Processing (NLP), and Machine Learning (ML) venues that contain ethical or broader impact discussions. Our goal is to train generative AI models on this dataset that could reason in a similar way to researchers who reason about ethical implications of their proposed research. (ii) BID-QA. A dataset of question-answer pairs in which given an ethics related question coming from checklists developed previously by ethical AI researchers, human annotators would provide appropriate answers according to a provided broader impact statement or hypothesis. Our goal in this task is to test whether AI systems can answer ethics related questions in the same way as humans do given a broader impact statement or hypothesis tested in different scenarios. We describe details of each dataset in more detail below. 62 Split Overall Papers NeurIPS EMNLP AAAI ACL NAACL ICWSM Train 2,000 1,390 192 178 135 91 14 Validation 340 246 33 22 23 14 2 Test 383 262 46 28 26 18 3 Total 2,723 1,898 271 228 184 123 19 Table 7.1: Breakdown of our BID-Gen dataset containing research papers with broader impact statements into train, validation, and test splits. 7.1.1 BID-Gen: ADatasetforGeneratingBroaderImpactStatements To perform the task of generating overall hypothesis about ethical considerations of an idea, proposal, or a task, we required research papers that discussed ideas along with a section describing broader impact and ethical considerations associated with their work. In light of this, we collected 2,723 research papers with broader impact statements from six different venues as represented in Table 7.1. We converted these papers from pdf files to json files using Allen AI’s S2ORC Doc2JSON (Lo et al., 2020), an open-source PDF processing utility. In addition to collecting papers only from the main conference, we also collected papers from the workshops that were accompanied by a broader or ethical impact section from the same venues. Depending on the experiments, we either used the abstracts only, the whole document, or important information extracted from these papers to generate broader impact statements. The statistics regarding the split of this dataset into train, test, and validation sets are demonstrated in Table 7.1. 7.1.2 BID-QA:ADatasetforAnsweringEthicalQuestions In addition to collecting research papers from various venues with a broader or ethical impli- cations section, we collected a second set of data in which we utilized a list of seven questions designed and outlined previously as guideline for researchers to write the ethical implication section (Mohammad, 2022; Madaio et al., 2020; Liu et al., 2022) with their corresponding answers extracted or inferred from the broader impact text. The list of questions are outlined in Table 7.2. Given a broader impact statement, we used Amazon Mechanical Turk (MTurk) along with SurgeAI platforms and asked the workers to answer questions listed in Table 7.2. If the answer to a question could not be directly extracted or inferred from the broader impact text, we directed the workers to leave the question un-answered. To avoid over-fitting to the provided seven questions listed in Table 7.2 and restricting ourselves to this limited set of questions, we asked the workers to design an additional question separately for each data-point, which can be extracted or inferred from the text with their corresponding answers. This additional data serves as one of our supplementary test cases in this work. The BID-QA dataset is used for the second proposed task. We collected 63 # Question 1 What are the contributions of this work? 2 In what applications the research can be used? 3 What are the potential positive societal implications of these applications/work? 4 What are the potential negative societal implications of these applications/work? 5 Who maybe at risk of experiencing harms using these systems, applications, and research? 6 Could this use case or product exclude anyone from the audience based on irrelevant criteria? How? 7 Are there any privacy, fairness, robustness, or other concerns with this work? explain. Table 7.2: Seven pre-defined questions used to curate the BID-QA dataset for Task 2. 3,489 question-answer pairs from analyzing over 500 broader impact statements. We excluded questions for which an answer could not be extracted or directly inferred from the text. After collecting the overall BID-QA dataset from the workers over various platforms, we divided this dataset into different portions to do various analysis and studies as follows: StandardSetfromtheBID-QADataset(BID-QA-extract). The first portion of the BID-QA dataset is the standard set in which we only consider the question-answer pairs whose questions are coming from the pre-defined seven questions borrowed from previous work and not the ones designed by the workers. In addition, this set contains the question-answer pairs whose answers can be directly extracted from the given broader impact statement and does not require any type of reasoning or extra knowledge. We collected 1,397 overall question-answer pairs for this set and used it for training, validating, and testing our models using the corresponding split from this set. The statistics of train, test, and validation splits of this set can be found in Table 7.3. HardSetfromtheBID-QADataset(BID-QA-reason). The second set from the BID-QA dataset (BID-QA-reason), contains question-answer pairs in which questions are still coming from the pre-defined seven questions but the provided answers are not directly extracted from the text by the workers. For these sets of questions, workers incorporated some type of reasoning, extra knowledge, commonsense, or concatenation of different information from different parts of the text to provide answers to some of the questions which were harder to directly answer from the text. This set (BID-QA-reason) is primarily used for testing purposes and verifying that the models have the ability to perform harder tasks rather than pure knowledge extraction from the text with some reasoning capabilities. This is considered to be a harder task than directly extracting the whole answer from the text. The BID-QA-reason set contains 642 question-answer pairs which is primarily used as a harder test case in our work in addition to the standard test set (BID-QA-extract) discussed above. Additional Questions Set from the BID-QA Dataset (BID-QA-open). In addition to the BID-QA-extract and BID-QA-reason sets which cover the seven pre-defined standard questions, we curated an additional set, BID-QA-open, which contains worker-designed questions outside of the provided seven questions. This set can serve as an additional test set to examine whether 64 Set Name # QA Pairs BID-QA-extract Train 1,019 BID-QA-extract BID-QA-extract Val 90 BID-QA-extract Test 288 BID-QA-reason 642 BID-QA-open 384 BID-QA-expert 519 BID-QA-non-expert 547 Total (BID-QA) 3,489 Table 7.3: Detailed statistics of the BID-QA data including the existing different sets in it along with BID-QA-extract set’s train, test, validation split statistics. models can generalize to answering various forms of questions, rather than the pre-defined seven questions provided to them. BID-QA-open contains 384 question-answer pairs, and similar to BID-QA-reason is used as an additional test case. ExpertSetfromtheBID-QADataset(BID-QA-expert). We were also interested in assessing the impact of worker expertise on providing good quality answers to the questions. To this end, we used the SurgeAI platform and recruited workers with an AI expertise or background to answer the same pre-defined seven questions as before using a new set of 100 papers with a broader impact statement. We collected 519 question-answer pairs using AI experts for this set and used it as an additional test set. Non-expertSetfromtheBID-QADataset(BID-QA-non-expert). We then used the same SurgeAI platform this time recruiting workers with no AI expertise using the same 100 papers we used to curate the BID-QA-expert dataset and collected overall 547 question-answer pairs. We intended to test the differences between these two test sets and see which ones our models will obtain higher results and thus resonate with the most. Notice that while on MTurk platform we did not specify whether the worker needs to have an AI expertise or not, on SurgeAI platform we intentionally made this distinction due to the nature of the experiments we wanted to perform. Detailed statistics of the overall dataset along with its various splits and sets are demonstrated in Table 7.3. Overall, we collected 3,489 question-answer pairs for the BID-QA dataset including data collected from both MTurk and SurgeAI platforms with various sets included. 7.2 Tasks We propose two tasks on our curated datasets: (i) given a research statement or a proposal, generate a hypothesis about ethical considerations of the proposal similar to what a broader 65 impact or ethical statement section would look like in an academic research paper. (ii) Given a hypothesis or ethical statement, answer questions related to ethical considerations associated with the hypothesis and ultimately the proposal. 7.2.1 GeneratingHypothesis(Task1) To generate hypothesis similar to broader impact statements given a research proposal, we propose different strategies using different models. Below, we will describe each strategy along with the models used and corresponding results. Abstract to Broader Impact (Abs→BI). In this setup, train, test, and validation sets use the abstracts as the input and the broader impact statements are considered as the output. Thus, we fine-tuned different models using the abstracts of the papers along with the corresponding broader impact and ethical statements so that given an abstract the model can generate a broader impact statement. We used BART 1 (Lewis et al., 2020b), Distil-BART 2 (Shleifer and Rush, 2020), Longformer encoder-decoder 3 (Beltagy et al., 2020), and T5 Base 4 (Raffel et al., 2020) models to fine-tune on our dataset. For a more fair comparison, to make BART and Distil-BART models on par with Longformer and T5, we used the fine-tuned versions of BART and Distil-BART on the task of summarization. We then used various scores to report the quality of the generated broader impact statements on validation and test sets. We utilized Huggingface library for accessing all the pre-trained models, fine-tuning, and reporting metrics. Abstract and Extracted Information to Broader Impact (Abs+→BI). In this setup, train, validation, and test sets consider the abstracts plus extra information extracted from the research papers as inputs and the broader impact sections as the outputs. Thus, in addition to using the abstracts of the papers, we appended other important information extracted from the body of the paper to the abstracts and fine-tuned different models to generate the broader impact statements. We used the approach introduced in (Sun et al., 2022) to extract micro-level features including various information and metrics from the paper. These features include information about the experiments, such as sample sizes, effect sizes, p value, experimental models, and important sentences from the paper. We utilized the same models used in Abs→BI and reported results same way as in Abs→BI for this strategy. ResearchPapertoBroaderImpact(Paper→BI). In this setup, train, validation, and test sets use the entire research paper as input with the broader impact section as the output. Thus, we utilized the whole research paper texts except the broader impact section, to fine-tune the Longformer encoder-decoder model to generate broader impact statements. Notice that since the other models utilized in previous strategies (Abs→BI and Abs+→BI) are not capable of processing long texts in the research papers as inputs, we only utilized the Longformer model that is capable of processing long sequences in this strategy. The rest of the experimental setups are the same as in the previous strategies so as the scores reported on validation and test splits. 1 https://huggingface.co/facebook/bart-large-cnn 2 https://huggingface.co/sshleifer/student_xsum_3_12 3 https://huggingface.co/allenai/led-base-16384 4 https://huggingface.co/t5-base 66 Method Rouge-1↑ Rouge-2↑ Rouge-L↑ BLEU↑ METEOR↑ Model FT Strategy Validation Test Validation Test Validation Test Validation Test Validation Test BART N/A 19.4 20.1 3.1 3.5 11.2 11.6 3.4 4.3 11.5 11.8 Distil-BART N/A 9.5 9.8 1.2 1.4 6.9 7.2 0.0 0.0 4.7 4.9 Longformer N/A 27.6 27.4 4.7 4.9 13.4 13.2 24.9 23.4 21.7 22.1 T5 N/A 16.0 16.1 2.6 2.5 9.9 9.8 1.2 1.6 9.2 9.2 BART Abs→BI 31.1 30.8 5.9 5.6 14.0 14.1 27.4 27.6 23.6 23.7 Distil-BART Abs→BI 31.6 31.9 6.1 6.2 14.4 14.8 23.6 25.3 22.8 23.0 Longformer Abs→BI 29.7 29.6 5.4 5.3 14.0 13.9 21.8 23.5 21.5 21.7 T5 Abs→BI 28.7 28.4 5.4 5.1 14.3 14.2 14.7 15.9 18.9 18.6 BART Abs+→BI 31.3 31.8 7.1 7.4 15.1 15.4 27.5 29.4 23.7 24.9 Distil-BART Abs+→BI 32.1 32.1 7.5 7.4 15.2 15.4 26.3 27.0 24.0 24.2 Longformer Abs+→BI 30.6 30.4 6.9 6.7 15.2 15.1 23.9 24.9 22.8 22.7 T5 Abs+→BI 28.4 28.4 6.1 6.1 14.7 14.8 15.0 16.1 18.8 19.1 Longformer Paper→BI 29.4 28.9 5.5 5.4 13.0 12.9 23.2 21.9 25.3 25.9 Table 7.4: Results reported in terms of different scores on the task of generating hypothesis or broader impact statements (task 1) for different language models and strategies. In addition to the above strategies, we reported results using no fine-tuning on our curated datasets. In this setup (represented as N/A in the results), we provided the models with the abstracts of the papers during the inference time and generated results without any fine-tuning using our datasets. These sets of experiments are serving as baselines in which no fine-tuning is involved on the models using our datasets. 7.2.1.1 Results Table 7.4 reports the results from generating hypothesis (Task 1) by different models using different strategies. From the results, we observe that fine-tuning models over our datasets drastically improves performance compared to no fine-tuning. Overall, fine-tuning on the abstracts plus extracted information from the papers (Abs+→BI) gives us the best results across all the models and measures. While fine-tuning on extra information plus the abstract (Abs+ →BI) can be helpful in boosting the performance, we noticed that fine-tuning on the complete paper (Paper →BI) can be less effective on generating better hypothesis. This might be due to the fact that working with longer text is intrinsically more challenging. 7.2.2 AnsweringEthicalQuestions(Task2) We discuss our experiments and results obtained for Task 2 in which given a question about the hypothesis or the ethical and broader impact implications, the task is to provide reasonable answers. We first discuss the general experiments performed on the BID-QA-extract set of our curated 67 Method Rouge-1↑ Rouge-2↑ Rouge-L↑ BLEU↑ METEOR↑ Model Fine-tuned on Validation Test Validation Test Validation Test Validation Test Validation Test BART SQuAD2.0 19.8 17.9 13.3 13.3 18.0 16.8 21.8 13.3 18.7 15.3 Distil-BART SQuAD2.0 26.2 18.5 20.6 14.14 24.6 17.3 27.3 12.6 26.3 16.4 Longformer SQuAD2.0 18.2 20.6 12.6 16.3 17.1 20.0 8.3 8.7 14.2 16.3 T5 SQuAD2.0 2.2 4.8 1.3 3.6 2.2 4.8 0.0 0.0 3.1 5.3 BART SQuAD2.0+ours 44.5 41.7 36.5 34.8 41.7 39.3 32.3 32.4 46.0 44.7 Distil-BART SQuAD2.0+ours 46.1 40.8 38.8 33.9 43.3 37.9 30.5 30.7 49.3 43.9 Longformer SQuAD2.0+ours 33.9 30.0 28.3 25.4 31.8 28.2 25.5 28.1 37.3 34.2 T5 SQuAD2.0+ours 34.6 33.1 27.3 25.7 30.9 29.4 23.2 22.5 50.2 47.4 Table 7.5: Results on BID-QA-extract question answering test and validation sets according to different metrics over various models. These results show that scores can improve drastically when these models are further fine-tuned on our BID-QA-extract train set compared to only fine-tuning on the SQuAD2.0 dataset. dataset (BID-QA) and later discuss more detailed experiments that showcase the generalization ability of the models to harder tasks using our additional curated test sets (e.g., BID-QA-reason and BID-QA-open sets). We conduct two main types of experiments on the BID-QA dataset curated for Task 2. In the first set of experiments, we use various models and report the results on the BID-QA-extract set comparing the results with different fine-tuning strategies. In the second set of experiments, we use the additional test sets collected for Task 2 in this work in addition to the BID-QA-extract set to showcase different abilities including the generalization ability of the models into generating reasonable answers per given question in different setups. 7.2.2.1 StandardExperiments In this set of experiments, we used different models, similar to Task 1, and fine-tuned them on the SQuAD2.0 (Rajpurkar et al., 2018) dataset and compared the results to when we further fine-tuned the models on our curated dataset (BID-QA) for the question answering task proposed in this paper (Task 2) using the BID-QA-extract set. For this set of experiments, similar to Task 1, we used BART (Lewis et al., 2020b), Distil-BART (Shleifer and Rush, 2020), Longformer encoder- decoder (Beltagy et al., 2020), and T5 Base (Raffel et al., 2020) models. We also report the same set of scores in a similar manner as we did for Task 1 previously. For fine-tuning the models in addition to the SQuAD2.0 dataset, we utilized the train split of BID-QA-extract introduced in this paper and reported the results on the test and validation splits of BID-QA-extract. 7.2.2.2 AdditionalExperiments In addition to using BID-QA-extract from our dataset, we utilized the BID-QA-reason, BID-QA- open, BID-QA-expert, and BID-QA-non-expert datasets as additional test cases on the various 68 ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 10 20 30 40 50 60 Score BART BID-QA-extract (test) BID-QA-reason BID-QA-open BID-QA-expert BID-QA-non-expert ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 10 20 30 40 50 60 Score Distil-BART BID-QA-extract (test) BID-QA-reason BID-QA-open BID-QA-expert BID-QA-non-expert ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 10 20 30 40 50 60 Score Longformer BID-QA-extract (test) BID-QA-reason BID-QA-open BID-QA-expert BID-QA-non-expert ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 10 20 30 40 50 60 Score T5 BID-QA-extract (test) BID-QA-reason BID-QA-open BID-QA-expert BID-QA-non-expert Figure 7.1: Results from various models fine-tuned on SQuAD2.0 and our data on different test sets from Task 2. models used in the previous experiments. The main goal is to analyze the generalization ability of such models on more challenging cases, in which either the answers are harder to be extracted directly from the text or the questions are different from the ones seen during the train process. We also want to compare results on expert (BID-QA-expert) vs non-expert (BID-QA-non-expert) sets and explore whether the models correlate better with and produce similar results to a specific group (expert vs non-expert). 7.2.2.3 ResultsonStandardExperiments Table 7.5 demonstrates the results from generating answers given an ethics related question along with a broader impact (hypothesis) statement. These results demonstrate that further fine-tuning various models on our BID-QA-extract train set can drastically improve the outcome according to various scores compared to only fine-tuning these models on the SQuAD2.0 question answering dataset. These results demonstrate the test and validation results on the BID-QA-extract set. 7.2.2.4 ResultsonAdditionalExperiments We report the results from various models fine-tuned on SQuAD2.0 and further fine-tuned on our dataset (BID-QA-extract train set) using different test sets from Task 2 in Figure 7.1. Results in Figure 7.1 demonstrate that although the models have reasonable performance and generalization 69 ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 5 10 15 20 25 30 Score Generation multi-task single-task ROUGE-1 ROUGE-2 ROUGE-L BLEU METEOR 0 10 20 30 40 50 Score Question Answering multi-task single-task Figure 7.2: Results comparing the multitask setup vs separately fine-tuning for each task using the T5 model. Results for generating hypothesis/broader impact statements (Task 1) is demonstrated on the left side. Results for answering ethics related questions using the BID-QA-extract set (Task 2) is demonstrated on the right side. ability on the BID-QA-open test set which contains external questions designed by the workers, the models do not perform as well as when they are tested on the pre-defined seven questions that they have seen during training as expected. We also demonstrate that the models obtained lower scores for questions that required some reasoning (BID-QA-reason). Finally, we report that although by manually analyzing the datasets coming from experts vs. non-experts we saw noticeable higher data quality coming from the expert set, experimental results were close and not significantly different for expert vs non-expert sets. 7.3 Multitasking In addition to considering each of these two tasks separately (Tasks 1 and 2), we aimed to achieve the goals of both tasks in a multitask setup and the effects on the performance compared to fine-tuning on each task separately. To this end, we fine-tuned a T5 Base model once on each task separately and compared it to when we fine-tuned the T5 Base model on both tasks simultaneously. Notice that for Task 2, we used the T5 Base model without pre-fine-tuning it on the SQuAD2.0 dataset and directly fine-tuned it on our task to be able to perform a fair comparison to the multitask setup. Results in Figure 7.2 demonstrate that the multitask setup achieves a similar performance to that of separately fine-tuning for each task. Results for Task 1 (hypothesis generation) is demonstrated in Figure 7.2-left and Task 2 (answering ethics related questions) in Figure 7.2-right. We used BID-QA-extract for Task 2 in this batch of experiments. These results demonstrate that a model can be trained to perform two tasks at once in a multitask setup without negative impact on the performance. 70 7.4 Conclusion We proposed two inter-related tasks for automatically assessing possible ethical implications of a research idea based on different inputs describing that idea. Towards this goal, we first constructed two datasets, and then proposed different methods and strategies for addressing those tasks. Our results confirm the feasibility of automatically assessing potential broader impacts of a research work based on its textual description. Although internalizing ethics in AI system may be a challenging problem to solve, we have introduced datasets, tools, and tasks that can guide us toward moving a step closer to designing such systems. These types of ideas will be useful for an ultimate responsible AI system design. Such systems can also be used for debugging purposes. We believe that combining this work with commonsense reasoning can open horizons for us to have systems that can reason about consequences of their actions. 71 Chapter8 RelatedWork We discuss related work in the context of responsible AI literature to cover various aspects. 8.1 Fairness Research in fairness spans into various directions and its incorporation becomes more widespread in different areas of AI. One main and important topic studied in fairness is coming up with appropriate fairness definitions and measures for a particular use-case and application. Another important research area is to propose solutions to mitigate bias and achieve more fair outcomes. 8.1.1 FairnessDefinitions Binns (2018) studied fairness definitions in political philosophy domain and tried to tie them to machine-learning. Hutchinson and Mitchell (2019) studied the 50-year history of fairness definitions in the areas of education and machine-learning. Verma and Rubin (2018) listed and explained some of the important fairness definitions. Saxena et al. (2019) studied general public’s perception on some of the existing fairness definitions in computer science literature. Fairness definitions can also be categorized based on whether or not they are applicable to groups of people, individuals, or subgroups. In Individual Fairness the paradigm is to give similar predictions to similar individuals (Kusner et al., 2017; Dwork et al., 2012). In Group Fairness the paradigm is to treat different groups equally (Kusner et al., 2017; Dwork et al., 2012). InSubgroupFairness the paradigm is to obtain the best properties of the group and individual notions of fairness. Subgroup Fairness picks a group fairness constraint like equalizing false positive rates and asks whether this constraint holds over a large collection of subgroups (Kearns et al., 2018, 2019). For instance, statistical parity (Kusner et al., 2017; Dwork et al., 2012), conditional statistical parity (Corbett-Davies et al., 2017), equalized odds (Hardt et al., 2016a), equality of opportunity (Hardt et al., 2016a), treatment equality (Berk et al.) and test fairness (Chouldechova, 2017) notions are group based notions. On the other hand, fairness through unawareness (Grgic- Hlaca et al., 2016; Kusner et al., 2017), fairness through awareness (Dwork et al., 2012), and counterfactual fairness (Kusner et al., 2017) are considered as individual fairness notions. 72 8.1.2 MethodsforFairness Methods for reducing biases and as a result improving fairness generally fall under three cat- egories: (1) pre-processing methods, (2) in-processing methods, (3) post-processing methods. Pre-processing techniques try to transform the data so that the underlying discrimination is re- moved (d’Alessandro et al., 2017). In-processing techniques try to modify the learning algorithms to remove discrimination during the model training process (d’Alessandro et al., 2017). Post- processing is performed after training by accessing a holdout set which was not involved during the training of the model (d’Alessandro et al., 2017). In post-processing techniques, the model is considered as a black-box and all the effort to make the outcome fair is applied post-training. For instance, Zafar et al. (2015a) proposed an in-processing classification approach to achieve fair outcomes, while Pleiss et al. (2017) proposed a post-processing classification approach. Proposing approaches to improve fairness and mitigate bias are considered in various AI domains, such as representation learning (Moyer et al., 2018b), word embedding (Zhao et al., 2018b), graph embedding (Bose and Hamilton, 2019), correference resolution (Zhao et al., 2018a), and machine translation (Font and Costa-jussà, 2019). There is also a large body of work considering fairness in classification since classification is a canonical task in machine learning and is widely used in different areas that can be in direct contact with humans. Therefore, certain methods have been proposed (Kamishima et al., 2012; Menon and Williamson, 2018; Goel et al., 2018; Krasanakis et al., 2018) that satisfy certain definitions of fairness in classification. Ustun et al. (2019) try to satisfy subgroup fairness in classification. Hardt et al. (2016b) propose solutions to satisfy equality of opportunity and equalized odds similar to work introduced in (Woodworth et al., 2017) which aims to satisfy equalized odds definition of fairness. Lastly, Zafar et al. (2015a); Aghaei et al. (2019) try to satisfy both disparate treatment and disparate impact. Other methods try to not only satisfy some fairness constraints but to also be stable toward change in the test set (Huang and Vishnoi, 2019). Wu et al. (2018) propose a general framework for learning fair classifiers. This framework can be used for formulating fairness-aware classification with fairness guarantees. In another work, Calders and Verwer (2010) propose three different modifications to the existing Naive Bayes classifier for discrimination-free classification. Oneto et al. (2019) take a new approach into fair classification by imposing fairness constraints into a Multitask Learning (MTL) framework. In addition to imposing fairness during training, this approach can benefit the minority groups by focusing on maximizing the average accuracy of each group as opposed to maximizing the accuracy as a whole without attention to accuracy across different groups. In a similar work, Dwork et al. (2018) propose a decoupled classification system where a separate classifier is learned for each group. They use transfer learning to reduce the issue of having less data for minority groups. Jiang et al. propose to achieve fair classification by mitigating the dependence of the classification outcome on the sensitive attributes by utilizing the Wasserstein distance measure. In addition, Kamiran and Calders (2010) propose the Preferential Sampling (PS) method to create a discrimination free train data set. They then learn a classifier on this discrimination free dataset to have a classifier with no discrimination. 73 8.2 Robustness Ensuring robustness of AI systems has been extensively studied in both NLP and ML domains. On the one side, some work try to highlight the existing vulnerabilities in these systems. On the other hand, some other work try to propose solutions to avoid such vulnerabilities. Analyzing robustness of AI systems is an important contribution towards responsible AI system design. 8.2.1 RobustnessinMachineLearning Research in robust ML is mostly focused on designing defenses and attacks against machine learning models (Steinhardt et al., 2017; Chakraborty et al., 2018; Li et al., 2018). Ultimately, the goal is for machine learning models to be robust toward malicious activities designed by adversaries. Thus, it is important to consider both sides of the spectrum in terms of designing the attacks and defenses that can overcome the attacks. In adversarial machine learning, different types of attacks, such as data poisoning and evasion attacks, exist. In evasion attacks, the goal is to come up with adversarial examples that are imperceptible to human eye but can deceive benign machine learning models during test time (Biggio et al., 2013; Moosavi-Dezfooli et al., 2016; Goodfellow et al., 2015). On the other hand, in data poisoning attacks, the goal is to manipulate the training data–via adding, removing, or changing instances–so that the learned model is malicious (Biggio et al., 2012; Shafahi et al., 2018). Different algorithms and approaches have been proposed for poisoning attacks focusing on accuracy as the performance measure (Biggio et al., 2012; Shafahi et al., 2018). Some other work consider the intersection of accuracy and fairness (Solans et al., 2020). 8.2.2 RobustnessinNaturalLanguageProcessing Crafting adversarial examples and using them in training was previously shown to be an effective technique in improving robustness of NLP and ML models (Nie et al., 2020; Dinan et al., 2019; Kiela et al., 2021). Not only that, but adversarial attacks can reveal important vulnerabilities in our systems (Zhang et al., 2020a). Although previous work has studied adversarial examples in NLP (Li et al., 2017; Zang et al., 2020; Morris et al., 2020; Mozes et al., 2021), most of them focused on accuracy as a metric of interest. A few other work studied toxicity and other ethical considerations (Wallace et al., 2019; Sheng et al., 2020). Cheng et al. (2019); Niu and Bansal (2018) studied adversarial attacks on conversational agents; however, their focus was on task oriented dialogue systems. Xu et al. (2020) also considered conversational domains and relied on human adversaries which can be costly and non-scalable. 8.3 Interpretability There is a body of work that tried to analyze the effect of the attention weights on interpretability of the model (Wiegreffe and Pinter, 2019; Jain and Wallace, 2019; Serrano and Smith, 2019). Other 74 work also utilized attention weights to define an attribution score to be able to reason about how transformer models such as BERT work (Hao et al., 2021). Notice that although Jain and Wallace (2019) claim that attention might not be explanation, a body of work has proved otherwise including (Wiegreffe and Pinter, 2019) in which authors directly target the work in Jain and Wallace (2019) and analyze in detail the problems associated with this study. In addition, Vig et al. (2020) analyze the effect of the attention weights in transformer models for bias analysis in language models. 8.4 Ethics There has been recent initiatives on requiring researchers to be more mindful of ethical implications of their research. In light of these initiatives, some conferences required authors to include an ethics statement in their papers. In addition, some researchers have investigated how to write better ethical statements or questions that one should be mindful when assessing ethical implications associated to a particular work (Mohammad, 2022; Madaio et al., 2020; Liu et al., 2022). Although it is important for researchers to be mindful of the implications associated with their research proposals, it is also important for AI systems to have basic realization of such considerations. The closest researchers have come to was to propose ideas on moral foundations theory (Ramezani et al., 2021; Kennedy et al., 2021). More attention needs to be paid on this aspect in ethics domain and its connections to responsible AI initiatives. 75 Chapter9 Discussion 9.1 Conclusion We have made the following progress in designing responsible AI systems: • In Chapter 3, we discussed how bias can affect AI systems in various stages of development starting from data, to algorithm design, to model outcomes. We first, analyzed and quantified existing biases in commonsense knowledge resources which are important data sources for models utilizing world knowledge and commonsense reasoning. We then showed how these biases can propagate to downstream models. We lastly proposed a solution based on pre-processing the data to mitigate the effect of existing representational harms or biases in these resources and their propagation to downstream models. In addition, we looked into algorithmic bias issues and how wrong algorithmic design choices can contribute to biases such as exclusion bias. We proposed new algorithmic solutions in community detection domain to mitigate the effect of exclusion bias in these algorithms. Finally, to show existing biases in model outcomes, we investigated NER systems and proposed measures to quantify existing biases in these systems. After performing analyses and investigations, we showcased the existence of gender bias in these systems. • In Chapter 4, we discussed and analyzed vulnerability of ML and NLP models to various intentional adversarial phenomena that can target societal and ethical concerns in these systems. We started this analysis by focusing on fairness as one of those concerns and demonstrated that ML models can be vulnerable to attacks targeting fairness of the systems. We then studied generative models in NLP domain and how adversarial agents can trigger toxic content generation in generative models specifically conversational models. We proposed defense mechanisms that can be useful in reducing toxic content generation not only in conversational domain but also applicable to any generation task with or without the existence of an adversary towards the goal of improving robustness of these models. • In Chapter 5, we proposed an approach to identify problematic features that can contribute to unfairness in an interpretable fashion. We then used this attribution framework and proposed a post-processing bias mitigation strategy to reduce the effect of the identified 76 problematic features. Lastly, we empirically showed its effectiveness in reducing bias in both tabular as well as textual data modalities. This approach can have various advantages including being interpretable, efficient, and effective to work with varoius sensitive attributes as well as fairness metrics. • In Chapter 6, we discussed the importance of prompt ambiguity and its resolution for a better human-AI interaction. It is not only desired to understand the AI system, but it is important for the AI system to accurately understand human intention. These interactions should be bilateral for us to build more responsible AI systems. In light of this, we introduced a framework that incorporates humans in the loop to resolve existing ambiguities in text-to- image generative models. We also introduced evaluation frameworks that can measure the amount of faithful generations in these systems. We showcased how incorporating humans in the loop can improve faithful generations in these models by performing both automatic and human evaluations. • In Chapter 7, we proposed a dataset and two tasks associated with the dataset that would help models to either generate ethical considerations related to a proposal or answer ethics related questions given an ethical statement. We believe that this dataset and the associated tasks can be significantly helpful in designing ultimate ethical and responsible AI systems. 9.2 FutureDirections With the emergence of new technologies, we would always have the need to be able to debug and audit these systems. There still so much work needs to be done in the responsible AI domain. In this dissertation, we have touched upon some, and here we will provide some ideas about future directions and possible extensions to our work that can be useful in designing and auditing responsible AI systems. As discussed in Chapter 3, bias mitigation strategies can sometimes have adverse effects on accuracy and quality of the models. Although in Chapter 3 we proposed an effective bias mitigation strategy to reduce the effect of representational harm or bias in knowledge bases, the approach resulted in model outputs with lower quality generations. A future direction can be to propose solutions with no degrading effects on accuracy or quality of such downstream models using knowledge bases. We also demonstrated the existing gender bias in NER systems; however, it is crucial to propose bias mitigation strategies that can reduce existing disparities between performance on male vs female names. In Chapter 4, we analyzed robustness of ML systems to two different proposed attacks that can target fairness of the models. We identified that ML models can be vulnerable towards such attacks. A future direction can be to propose defense mechanisms that can mitigate the effects of such intentional adversarial attacks on our systems. Moreover, the defense mechanism to reduce the amount of toxic content generation discussed in Chapter 4 is token level. It is possible that we lose context or grammar by completely masking these tokens. As a future work, one can investigate mitigating the effect of weights instead of complete masking depending on the importance of the 77 token. Lastly, introduced attacks and defenses can be used to generate adversarial data to train better quality dialogue models adversarially in the future. We also advocate for designing bias mitigation or fair learning algorithms that are interpretable. Most of the existing work do not consider these two aspects together to design better systems. Moreover, most of the concepts in responsible AI literature are considered separately and on their own. We encourage future research to look more into intersection of these concepts and try to incorporate most of the responsible AI literature concepts into their design strategies. Finally, to design an ultimate responsible AI system, we need systems that are more aware and can reason about ethical and societal consequences of the tasks or design strategies associated with them. We have introduced a small moving step towards designing such systems by curating a dataset and proposing tasks; however, more work needs to be done in order for us to have such aware systems. As a future work, researchers can look into intersection of this idea with existing work in commonsense reasoning and improve upon current existing AI systems. 78 Bibliography Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. 2019. Learning optimal and fair decision trees for non-discriminative decision-making. InProceedingsoftheAAAIConferenceonArtificial Intelligence, volume 33, pages 1418–1426. Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Voll- graf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/N19-4010. Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649. Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150. Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in crim- inal justice risk assessments: The state of the art. Sociological Methods & Research, page 0049124118782533. Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. 2015. Do you see what I mean? visual resolution of linguistic ambiguities. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Language Processing, pages 1477–1487, Lisbon, Portugal. Association for Computational Linguistics. URL:https://aclanthology.org/D15-1172. Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter Clark. 2020. Genericskb: A knowledge base of generic statements. arXiv preprint arXiv:2005.00660. Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer. Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012. Poisoning attacks against support vector machines. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 1467–1474, Madison, WI, USA. Omnipress. RDP Binns. 2018. Fairness in machine learning: Lessons from political philosophy. Journal of Machine Learning Research. 79 Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata. URL:https://doi.org/10.5281/zenodo.5297715. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008. Avishek Bose and William Hamilton. 2019. Compositional fairness constraints for graph embed- dings. In International Conference on Machine Learning, pages 715–724. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. Comet: Commonsense transformers for knowledge graph construction. In Association for Computational Linguistics (ACL). Toon Calders and Sicco Verwer. 2010. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2):277–292. Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. 2018. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069. Lu Cheng, Kush R Varshney, and Huan Liu. 2021. Socially responsible ai algorithms: Issues, purposes, and challenges. Journal of Artificial Intelligence Research , 71:1137–1181. Minhao Cheng, Wei Wei, and Cho-Jui Hsieh. 2019. Evaluating and enhancing the robustness of dialogue systems: A case study on a negotiation agent. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3325–3335, Minneapolis, Minnesota. Association for Computational Linguistics. URL:https://aclanthology.org/N19-1336. Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM. Brian d’Alessandro, Cathy O’Neil, and Tom LaGatta. 2017. Conscientious classification: A data scientist’s guide to discrimination-aware classification. Big data, 5(2):120–134. Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phúc Lê Khc, Luke Melas, and Ritobrata Ghosh. 2021. Dall·e mini. URL: https://github.com/borisdayma/ dalle-mini. Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128. 80 Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics. URL:https://aclanthology.org/ D19-1461. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations. Dheeru Dua and Casey Graff. 2017. UCI machine learning repository. URL: http://archive. ics.uci.edu/ml. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pages 214–226, New York, NY, USA. ACM. URL:http://doi.acm.org/ 10.1145/2090236.2090255. Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. 2018. Decoupled classifiers for group-fair and efficient machine learning. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 119–133, New York, NY, USA. PMLR. URL:http://proceedings.mlr.press/ v81/dwork18a.html. Elaine Fehrman, Awaz K Muhammad, Evgeny M Mirkes, Vincent Egan, and Alexander N Gorban. 2017. The five factor model of personality and evaluation of drug consumption risk. In Data Science, pages 231–242. Springer. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 363–370, Stroudsburg, PA, USA. Association for Computational Linguistics. URL:https://doi.org/10. 3115/1219840.1219885. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378. Joel Escudé Font and Marta R Costa-jussà. 2019. Equalizing gender biases in neural machine translation with word embeddings techniques. arXiv preprint arXiv:1901.03116. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Real- toxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369. Naman Goel, Mohammad Yaghini, and Boi Faltings. 2018. Non-discriminatory machine learning through convex fairness criteria. In Thirty-Second AAAI Conference on Artificial Intelligence . 81 Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. URL: http://arxiv.org/ abs/1412.6572. Nina Grgic-Hlaca, Muhammad Bilal Zafar, Krishna P Gummadi, and Adrian Weller. 2016. The case for process fairness in learning: Feature selection for fair decision making. In NIPS Symposium on Machine Learning and the Law, volume 1, page 2. Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. 2020. A knowledge- enhanced pretraining model for commonsense story generation. Transactions of the Association for Computational Linguistics, 8:93–108. Umang Gupta, Aaron M Ferber, Bistra Dilkina, and Greg Ver Steeg. 2021. Controllable guarantees for fair outcomes via contrastive information estimation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 7610–7619. S. Hajian and J. Domingo-Ferrer. 2013. A methodology for direct and indirect discrimination prevention in data mining. IEEE Transactions on Knowledge and Data Engineering, 25(7):1445– 1459. Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 12963–12971. Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016a. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, vol- ume 29. Curran Associates, Inc. URL:https://proceedings.neurips.cc/paper/2016/file/ 9d2682367c3935defcb1f9e247a97c0d-Paper.pdf. Moritz Hardt, Eric Price, Nati Srebro, et al. 2016b. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315–3323. Lingxiao Huang and Nisheeth Vishnoi. 2019. Stable and fair classification. In International Conference on Machine Learning, pages 2879–2890. Ben Hutchinson and Margaret Mitchell. 2019. 50 years of test (un) fairness: Lessons for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 49–58. ACM. C. Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1):216–225. URL:https://ojs.aaai.org/index.php/ICWSM/article/view/14550. Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapo- lis, Minnesota. Association for Computational Linguistics. URL: https://www.aclweb.org/ anthology/N19-1357. 82 Ayush Jaiswal, Daniel Moyer, Greg Ver Steeg, Wael AbdAlmageed, and Premkumar Natarajan. 2020. Invariant representations through adversarial forgetting. Proceedings of the AAAI Conference on Artificial Intelligence , 34(04):4272–4279. URL: https://ojs.aaai.org/index.php/AAAI/ article/view/5850. Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. Wasserstein fair classification. Faisal Kamiran and Toon Calders. 2010. Classification with no discrimination by preferential sampling. InProc.19thMachineLearningConf.BelgiumandTheNetherlands, pages 1–6. Citeseer. Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35–50. Springer. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerry- mandering: Auditing and learning for subgroup fairness. InInternationalConferenceonMachine Learning, pages 2569–2577. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2019. An empirical study of rich subgroup fairness for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 100–109. ACM. Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Joe Hoover, Ali Omrani, Jesse Graham, and Morteza Dehghani. 2021. Moral concerns are differentially observable in lan- guage. Cognition, 212:104696. URL: https://www.sciencedirect.com/science/article/ pii/S0010027721001153. Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2021.naacl-main.324. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR. Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885– 1894. JMLR. org. Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger data poisoning attacks break data sanitization defenses. arXiv preprint arXiv:1811.00741. 83 Emmanouil Krasanakis, Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2018. Adaptive sensitive reweighting to mitigate bias in fairness-aware classi- fication. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 853–862, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee. URL:https://doi.org/10.1145/3178876.3186133. Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems 30, pages 4066–4076. Curran Associates, Inc. URL: http://papers.nips.cc/paper/6995-counterfactual-fairness.pdf. J Larson, S Mattu, L Kirchner, and J Angwin. 2016. Compas analysis. GitHub, available at: https://github. com/propublica/compas-analysis[Google Scholar]. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020a. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020. acl-main.703. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020b. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. Guofu Li, Pengjia Zhu, Jin Li, Zhemin Yang, Ning Cao, and Zhiyi Chen. 2018. Security matters: A survey on adversarial machine learning. arXiv preprint arXiv:1810.07339. Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 21–27, Valencia, Spain. Association for Computational Linguistics. URL: https://aclanthology.org/E17-2004. Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. 2019. KagNet: Knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2829–2839, Hong Kong, China. Association for Computational Linguistics. URL:https://www.aclweb.org/anthology/D19-1282. David Liu, Priyanka Nanayakkara, Sarah Ariyan Sakha, Grace Abuhamad, Su Lin Blodgett, Nicholas Diakopoulos, Jessica R. Hullman, and Tina Eliassi-Rad. 2022. Examining responsibility and deliberation in ai impact statements and ethics reviews. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’22, page 424–435, New York, NY, USA. Association for Computing Machinery. URL:https://doi.org/10.1145/3514094.3534155. 84 Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Associ- ation for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/2020.acl-main.447. Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach. 2020. Co- designing checklists to understand organizational challenges and opportunities around fairness in ai. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, page 1–14, New York, NY, USA. Association for Computing Machinery. URL: https: //doi.org/10.1145/3313831.3376445. David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2018. Learning adversarially fair and transferable representations. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3384–3393. PMLR. URL: http://proceedings.mlr.press/v80/madras18a.html. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60. URL: http://www. aclweb.org/anthology/P/P14/P14-5010. Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, and Aram Galstyan. 2022a. Robust conversa- tional agents against imperceptible toxicity triggers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831–2847, Seattle, United States. Association for Computational Linguistics. URL: https://aclanthology.org/2022.naacl-main.204. Ninareh Mehrabi, Thamme Gowda, Fred Morstatter, Nanyun Peng, and Aram Galstyan. 2020. Man is to person as woman is to location: Measuring gender bias in named entity recognition. In Proceedings of the 31st ACM Conference on Hypertext and Social Media, HT ’20, page 231–232, New York, NY, USA. Association for Computing Machinery. URL:https://doi.org/10.1145/ 3372923.3404804. Ninareh Mehrabi, Umang Gupta, Fred Morstatter, Greg Ver Steeg, and Aram Galstyan. 2022b. Attributing fair decisions with attention interventions. In Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022), pages 12–25, Seattle, U.S.A. Associa- tion for Computational Linguistics. URL:https://aclanthology.org/2022.trustnlp-1.2. Ninareh Mehrabi, Fred Morstatter, Nanyun Peng, and Aram Galstyan. 2019. Debiasing community detection: The importance of lowly connected nodes. In2019IEEE/ACMInternationalConference on Advances in Social Networks Analysis and Mining (ASONAM), pages 509–512. IEEE. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021a. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6). URL: https: //doi.org/10.1145/3457607. Ninareh Mehrabi, Muhammad Naveed, Fred Morstatter, and Aram Galstyan. 2021b. Exacerbating algorithmic bias through fairness attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8930–8938. 85 Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, and Aram Galstyan. 2021c. Lawyers are dishonest? quantifying representational harms in commonsense knowledge re- sources. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5016–5033, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. URL:https://aclanthology.org/2021.emnlp-main.410. Aditya Krishna Menon and Robert C Williamson. 2018. The cost of fairness in binary classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 107–118, New York, NY, USA. PMLR. URL: http://proceedings.mlr.press/v81/menon18a.html. Saif Mohammad. 2022. Ethics sheets for AI tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8368–8379, Dublin, Ireland. Association for Computational Linguistics. URL: https://aclanthology.org/2022. acl-long.573. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582. John Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, and Yanjun Qi. 2020. Reevaluating adver- sarial examples in natural language. In Findings of the Association for Computational Linguis- tics: EMNLP 2020, pages 3829–3839, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2020.findings-emnlp.341. Torill Elvira Mortensen. 2018. Anger, fear, and games: The long event of# gamergate. Games and Culture, 13(8):787–806. Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. 2018a. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. URL:https://proceedings.neurips.cc/paper/ 2018/file/415185ea244ea2b2bedeb0449b926802-Paper.pdf. Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. 2018b. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pages 9084–9093. Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg, and Lewis Griffin. 2021. Contrasting human- and machine-generated word-level adversarial examples for text classifica- tion. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8258–8270, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. URL: https://aclanthology.org/2021.emnlp-main.651. Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2021.acl-long.416. 86 Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901. Tong Niu and Mohit Bansal. 2018. Adversarial over-sensitivity and over-stability strategies for dialogue models. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 486–496, Brussels, Belgium. Association for Computational Linguistics. URL: https://aclanthology.org/K18-1047. Luca Oneto, Michele Doninini, Amon Elders, and Massimiliano Pontil. 2019. Taking advantage of multitask learning for fair classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 227–237. Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness and calibration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems 30, pages 5680–5689. Curran Associates, Inc. URL: http://papers.nips.cc/paper/ 7151-on-fairness-and-calibration.pdf. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. URL: http://jmlr.org/papers/v21/20-074.html. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswer- able questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR. Aida Ramezani, Zining Zhu, Frank Rudzicz, and Yang Xu. 2021. An unsupervised framework for tracing textual sources of moral change. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1215–1228, Punta Cana, Dominican Republic. Association for Computational Linguistics. URL:https://aclanthology.org/2021.findings-emnlp.105. Proteek Chandan Roy and Vishnu Naresh Boddeti. 2019. Mitigating information leakage in image representations: A maximum entropy approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2586–2594. 87 Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487. Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050. Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine com- monsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 3027–3035. Nripsuta Ani Saxena, Karen Huang, Evan DeFilippis, Goran Radanovic, David C Parkes, and Yang Liu. 2019. How do fairness definitions fare?: Examining public attitudes towards algorithmic definitions of fairness. In Proceedingsofthe2019AAAI/ACMConferenceonAI,Ethics,andSociety, pages 99–106. ACM. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics. URL:https://www.aclweb.org/anthology/ P19-1282. Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. 2018. Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pages 6103–6113. Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3398–3403. Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. 2020. Towards Controllable Biases in Language Generation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3239–3254, Online. Association for Computational Linguistics. URL:https: //aclanthology.org/2020.findings-emnlp.291. Sam Shleifer and Alexander M Rush. 2020. Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002. David Solans, Battista Biggio, and Carlos Castillo. 2020. Poisoning attacks on algorithmic fairness. arXiv preprint arXiv:2004.07401. Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. 2019. Learning controllable fair representations. In Proceedings of Machine Learning Research, volume 89 of ProceedingsofMachineLearningResearch, pages 2164–2173. PMLR. URL:http://proceedings. mlr.press/v89/song19a.html. 88 Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: an open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 4444–4451. Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. 2017. Certified defenses for data poisoning attacks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems 30, pages 3517–3529. Curran Associates, Inc. URL: http://papers.nips.cc/paper/ 6943-certified-defenses-for-data-poisoning-attacks.pdf. Kexuan Sun, Zhiqiang Qiu, Abel Salinas, Yuzhong Huang, Dong-Ho Lee, Daniel Benjamin, Fred Morstatter, Xiang Ren, Kristina Lerman, and Jay Pujara. 2022. Assessing scientific research papers with knowledge graphs. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2467–2472. Yi-Lin Tuan, Connor Pryor, Wenhu Chen, Lise Getoor, and William Yang Wang. 2021. Local explanation of dialogue response generation. In Advances in Neural Information Processing Systems. URL:https://openreview.net/forum?id=1Av2E0EugkA. Berk Ustun, Yang Liu, and David Parkes. 2019. Fairness without harm: Decoupled classifiers with preference guarantees. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6373–6382, Long Beach, California, USA. PMLR. URL: http://proceedings.mlr.press/v97/ustun19a.html. Sahil Verma and Julia Rubin. 2018. Fairness definitions explained. In 2018 IEEE/ACM International Workshop on Software Fairness (FairWare), pages 1–7. IEEE. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems, volume 33, pages 12388– 12401. Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper/2020/file/ 92650b2e92217715fe312e6fa7b90d82-Paper.pdf. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal adversar- ial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Linguistics. URL:https://aclanthology.org/D19-1221. Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Nianwen Xue, Martha Palmer, Jena D Hwang, Claire Bonial, et al. 2012. Ontonotes release 5.0. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics. URL: https://www.aclweb.org/ anthology/D19-1002. 89 Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. 2017. Learning non-discriminatory predictors. arXiv preprint arXiv:1702.06081. Yongkai Wu, Lu Zhang, and Xintao Wu. 2018. Fairness-aware classification: Criterion, convexity, and bounds. arXiv:1809.04737. Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079. J. Yang, J. McAuley, and J. Leskovec. 2013. Community detection in networks with node attributes. In 2013 IEEE 13th International Conference on Data Mining, pages 1151–1156. Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2017. Fairness beyond disparate treatment & disparate impact: Learning classification without dis- parate mistreatment. InProceedingsofthe26thinternationalconferenceonworldwideweb, pages 1171–1180. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015a. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259. Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. 2015b. Learning fair classifiers. stat, 1050:29. Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. 2020. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020. acl-main.540. Lu Zhang, Yongkai Wu, and Xintao Wu. 2017. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the Twenty-Sixth Interna- tional Joint Conference on Artificial Intelligence, IJCAI-17 , pages 3929–3935. URL: https: //doi.org/10.24963/ijcai.2017/549. Rong Zhang, Qifei Zhou, Bo An, Weiping Li, Tong Mo, and Bo Wu. 2020a. Enhancing neural models with vulnerability via adversarial attack. In Proceedings of the 28th International Conference on ComputationalLinguistics, pages 1133–1146, Barcelona, Spain (Online). International Committee on Computational Linguistics. URL:https://aclanthology.org/2020.coling-main.98. Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. 90 Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020b. DIALOGPT : Large-scale generative pre-training for conversational response generation. InProceedingsof the58thAnnualMeetingof theAssociation for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020.acl-demos.30. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989, Copenhagen, Denmark. Association for Computational Linguistics. URL: https://aclanthology.org/D17-1323. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018a. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv:1804.06876. Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018b. Learning gender- neutral word embeddings. InProceedingsofthe2018ConferenceonEmpiricalMethodsinNatural Language Processing, pages 4847–4853. Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 , pages 4623–4629. International Joint Conferences on Artificial Intelligence Organization. URL: https: //doi.org/10.24963/ijcai.2018/643. Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur. 2021. Commonsense-focused dialogues for response genera- tion: An empirical study. In Proceedings of the 22nd Annual Meeting of the Special Interest Group onDiscourseandDialogue, pages 121–132, Singapore and Online. Association for Computational Linguistics. URL: https://aclanthology.org/2021.sigdial-1.13. Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention- based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 207–212, Berlin, Germany. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/P16-2034. Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148. 91
Abstract (if available)
Abstract
With the advancement of Artificial Intelligence (AI) and its omnipresent role in different applications, it is crucial to ensure that AI systems comply with responsible practices. Moreover, the environment in which AI systems learn and interact with contains various external factors that might adversely affect their behavior. Thus, those systems should be able to mitigate potentially negative impacts of such factors. This dissertation explores several important dimensions that are essential for designing responsible AI systems. First, we focus on fairness as a central concept for responsible AI systems and analyze existing biases in various data sources and models. Moreover, we describe a framework based on interpretability for generating fair and equitable outcomes. Second, we discuss robustness to external perturbations as another important property for such systems. Next, we discuss human-centered AI systems which take natural language prompts as input, demonstrate possible issues due to ambiguous interpretation of those prompts, and describe a framework for resolving such ambiguities and generating faithful outcomes to human intention. Finally, we discuss ideas for designing AI systems that can internalize ethics and form a realization about the consequences of tasks and design choices associated with them. We hope that the contributions presented in this dissertation will move us closer to having more responsible AI systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Robust causal inference with machine learning on observational data
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Automatic evaluation of open-domain dialogue systems
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Optimizing execution of in situ workflows
PDF
Simulation and machine learning at exascale
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Measuing and mitigating exposure bias in online social networks
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Building generalizable language models for code processing
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Predicting and modeling human behavioral changes using digital traces
PDF
Automatic decipherment of historical manuscripts
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Computational models for multidimensional annotations of affect
PDF
Information geometry of annealing paths for inference and estimation
PDF
Robust and proactive error detection and correction in tables
Asset Metadata
Creator
Mehrabi, Ninareh
(author)
Core Title
Responsible artificial intelligence for a complex world
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
12/13/2022
Defense Date
12/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Fairness,OAI-PMH Harvest,responsible artificial intelligence,robustness
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Dehghani, Morteza (
committee member
), Lerman, Kristina (
committee member
), Morstatter, Fred (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
mninareh@amazon.com,ninarehm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620750
Unique identifier
UC112620750
Identifier
etd-MehrabiNin-11363.pdf (filename)
Legacy Identifier
etd-MehrabiNin-11363
Document Type
Dissertation
Format
theses (aat)
Rights
Mehrabi, Ninareh
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-996
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
responsible artificial intelligence
robustness