Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Countering problematic content in digital space: bias reduction and dynamic content adaptation
(USC Thesis Other)
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COUNTERING PROBLEMATIC CONTENT IN THE DIGITAL SPACE BIAS REDUCTION AND DYNAMIC CONTENT ADAPTATION by Ali Omrani A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY COMPUTER SCIENCE May 2024 Copyright 2024 Ali Omrani © Copyright by Ali Omrani 2024 All Rights Reserved 2 I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Morteza Dehghani) Principal Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Xiang Ren) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Robin Jia) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Payam Piray) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Jeffrey Sorensen) 3 Dedication To Fariba, deeply missed each day, yet her warmth and wisdom continue to light my path. 4 Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.1 Harms of Problematic Content Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 The Social Roots of Problematic Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.1 Stereotypical Biases of Our Society in Language Models . . . . . . . . . . . . . . . . . 17 1.2.2 Dynamic Nature of Problematic Content . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2.3 Cultural Elements in Perceptions of Offensiveness . . . . . . . . . . . . . . . . . . . . 18 1.2.4 Human Annotator Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 2: Theory-Driven Bias Mitigation in Language Models . . . . . . . . . . . . . . . . 21 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Bias Mitigation in Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.2 Bias Mitigation in Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.3 The SCM and Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 SCM-based Bias Mitigation for Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Identifying a Bias Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Debiasing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Static Word Embedding Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.5 Results - Bias Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.6 Results - Word Embedding Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Mitigating Biases in Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Results - Bias Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Results - Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . 31 5 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 3: Embracing the Dynamic Nature of Problematic Content . . . . . . . . . . . . . 33 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.1 Problematic Content Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Multitask Learning for Problematic Content . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.3 Continual Learning and Few Shot Generalization . . . . . . . . . . . . . . . . . . . . . 36 3.2 Continual Learning Benchmark for Problematic Content Detection . . . . . . . . . . . . . . . 36 3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Models and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Upstream Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.3 Downstream Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.1 Efficacy of The Continual Learning Framework . . . . . . . . . . . . . . . . . . . . . . 42 3.4.2 The Impact of Upstream Task Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 The Impact of Number of Shots in Downstream Adaptation . . . . . . . . . . . . . . . 46 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 4: Towards Multilingual Culturally Sensitive Models . . . . . . . . . . . . . . . . . 49 4.1 Building a Multilingual Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Multilingual Continual Learning Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Leveraging Cultural Diversity in Offensive Language Detection . . . . . . . . . . . . . . . . . 53 4.4 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1 Cross lingual Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Culture, Language, and Offensiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 English Data Does Not Always Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.7 Cultural Diversity Improves Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 5: Understanding the Negative Effects of Noisy Labels . . . . . . . . . . . . . . . . 59 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.1 Sources of Label Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 5.1.2 Robustness to Noisy Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1.3 Parameter Efficient Tuning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 LoRA is Also Robust to Label Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.2 Learning and Forgetting in LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.3 Robustness in the Face of Data Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 LoRA Almost Exclusively Learns from the Clean Data . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Deeper Dive into Learning and Forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Increasing Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter A: Supplemental Material for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.1 Implementation Details for Static Word Embeddings. . . . . . . . . . . . . . . . . . . . . . . . 72 A.1.1 Training Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.1.2 Debiasing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.1.3 Bias Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 A.1.5 Word Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.1.6 SCM-based Debiasing for More Social Groups . . . . . . . . . . . . . . . . . . . . . . . 73 A.2 Implementation Details for Debiasing Language Models . . . . . . . . . . . . . . . . . . . . . 74 A.2.1 SCM Words for Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2.2 Gender and Religion words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2.3 DPCE Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2.4 ADEPT + SCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter B: Supplemental Material for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B.1 Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B.1.1 Hardware and Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B.1.2 Data Sources, Statistics, and License Information . . . . . . . . . . . . . . . . . . . . . 77 B.1.3 Model Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 B.1.4 Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter C: Supplemental Material for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.1 Hardware and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.2 Measuring WEIRDness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.3 Linguistic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 C.4 Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7 Chapter D: Supplemental Material for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 130 D.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 D.2 Detailed Results for Robustness to Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8 List of Tables 2.1 ECT and EQT for gender, race, and age. Subscript “same” denotes the debiasing was performed with respect to the corresponding dimension (e.g. PPsame denotes PP was applied to gender for ECTgender.) and subscript “SCM” refers to debiasing with respect to the SCM subspace. Debiasing was repeated 30 times for each method, and bold values indicate higher scores (per method) with non-overlapping 95% confidence intervals. HD was limited to gender because of other dimensions’ lack of equality sets. . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Embedding utility for debiased models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Summary of Bias Mitigation Algorithms used for Language Models . . . . . . . . . . . . . . . 28 2.4 SEAT effect sizes for selected tests on Race (S3, S4, S5), Gender (S6, S7, S8), and Religion (SR1, SR2). *: p¡ 0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 StereoSet and CrowS-pairs benchmarks on BERTLARGE, and debiased models. We debiased the original model by DPCE and ADEPT with three different sets of attributes, Gender, Religion, and SCM. The best score on each algorithm based on the significant level on ten runs is in bold. → 50 means the ideal model score is 50. . . . . . . . . . . . . . . . . . . . . . 30 2.6 GLUE benchmark for Language Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 AUC scores for chronological experiment. ∆ values are calculated in comparison to the corresponding Vanilla model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Results in AUC for experiments with alternative upstream task order. Rows marked with “Chronological” show the results of experiments with chronologically ordered datasets but shuffled task orders within a dataset. Rows marked with “Random Order” show the results on complete random order of upstream tasks. The ∆ values are computed in comparison to Adapter-Vanilla in each experiment. Notably, BiHNet+Reg demonstrates very stable performance regardless of the upstream task order. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Source, task, statistics, and reference of datasets for multilingual benchmark. . . . . . . . . . 52 4.2 Instant and final AUC for the multilingual benchmark with random order. . . . . . . . . . . . 52 A.1 ECT for each group when the model is debiased along the same dimension (PPsame, i.e. debiasing on gender for gender) and using SCM (PPSCM) . . . . . . . . . . . . . . . . . . . . 73 A.2 StereoSet benchmark for DPCE with different sample sizes.. . . . . . . . . . . . . . . . . . . . 75 9 B.1 Number of label occurrences in upstream tasks test sets. . . . . . . . . . . . . . . . . . . . . 78 B.2 Number of label occurrences in downstream tasks test sets. . . . . . . . . . . . . . . . . . . . 79 B.3 Number of datasets by source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 B.4 License information for all datasets used in the benchmark. According to this information, all datasets can be used for research purposes . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 B.5 P values to pairwise T-test between the fewshot performances for experiments with the chronological order of upstream tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 B.6 Final and instant AUC and F1 scores for upstream tasks for the chronological experiment . . 87 B.7 Instant and final AUC and F1 scores for upstream tasks for the random order experiment . . 93 B.8 AUC and F1 scores for few-shot downstream tasks for the chronological experiment . . . . . 106 B.9 AUC and F1 scores for few-shot downstream tasks for the random order experiment . . . . . 126 D.1 F1 scores of LoRA and fine-tuning on balanced IMDB dataset for various degrees of noise. . . 130 10 List of Figures 2.1 Group-specific debiasing vs. our proposed group-agnostic approach. Rather than iteratively debiasing with respect to each social attribute (e.g., gender or race), embeddings or language models are debiased with respect to warmth and competence, the two dimensions of the Stereotype Content Model (SCM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Current static approaches (I) train and evaluate models on a fixed set of datasets. Our benchmark embraces the dynamic aspects of problematic content detection in two stages. The upstream trainig (II) and evaluation (III) where data is assumed to be coming in a stream, and downstream fewshot evaluation (IV) that measure models’ generalization to novel forms of problematic content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Sequence of upstream tasks in the experiment with chronological task order. Note that datasets are ordered according to the earliest publication date of the data and tasks (i.e., labels) within each dataset are ordered randomly. . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Shuffled sequence of tasks for the chronological experiment. . . . . . . . . . . . . . . . . . . . 44 3.4 Random sequence of upstream tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Few-shot performance (AUC) based on number of shots (K) . . . . . . . . . . . . . . . . . . . 46 4.1 The relationship between cultural differences, as measured by the WEIRDness distance between La and Lt, and ∆(La, Lt). The regression line, derived from Equation 4.1, indicates that the WEIREDness distance predicts CLTL performance gains (β = 0.3, p = 0.02). . . . . 54 4.2 Distribution of ∆(La → Lt) by auxiliary language La. Most languages exhibit both positive and negative impacts on CLTL, underscoring the significance of considering cultural factors when choosing La. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Comparison of learning dynamics for LoRA (left) and fine-tuning (right) on a balanced subset of the IMDB dataset. Both LoRA and fine-tuning exhibit robustness to noise, achieving high validation performances. However, LoRA demonstrates a distinctive resistance to overfitting the noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Percentage of clean samples correctly classified by LoRA. LoRA demonstrates a consistent ability to learn almost exclusively from the clean samples. . . . . . . . . . . . . . . . . . . . 62 11 5.3 Number of forgettable data points for LoRA (blue) and fine-tuning (orange). LoRA consistently forgets fewer data points on the validation set. . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 The best validation performance degradation happens for lower values of noise as imbalance worsens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Percentage of clean samples correctly classified by fine-tuning. . . . . . . . . . . . . . . . . . 65 5.6 Learning dynamics for LoRA applied to RoBERTa-large on a balanced subset of the IMDB. 66 5.7 Comparison of the accuracy on clean (left) and noisy (right) samples in the training set for LoRA applied to RoBERTa-large on balanced IMDB dataset. . . . . . . . . . . . . . . . . . 67 5.8 Comparison of the accuracy on learning (right) and forgetting (left) for LoRA applied to RoBERTa-large on balanced IMDB dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.9 Comparison of the accuracy on noisy samples in the training set for LoRA (left) and finetuning (right) on balanced IMDB dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.10 Comparison of learning events for LoRA (left) and fine-tuning (right) on balanced IMDB dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.11 Comparison of forgetting events for LoRA (left) and fine-tuning (right) on balanced IMDB dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.1 t-SNE plot after running ADEPT based on SCM. Some pairwise words, mostly in the warmth dimension (red/yellow) cluster together, but others, mostly in the competence dimension (green/cyan) do not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 C.1 The CLTL performance change between 105 language pairs. . . . . . . . . . . . . . . . . . . 129 12 Abstract Problematic content, such as hate speech, harassment, and negative stereotypes, poses a daunting challenge that can give rise to discrimination, prejudice, and exclusion, ultimately eroding inclusivity and psychological well-being [23]. Both academic researchers and industry leaders have recognized the adverse impacts of problematic content and have developed various strategies and resources to understand and tackle them [148, 161]. Nevertheless, a significant portion of these endeavors neglects to recognize that problematic content fundamentally stems from societal and psychological issues [77]. In this thesis, I argue that computational solutions for countering problematic content can be significantly improved by leveraging insights from social and cognitive psychology. I begin by discussing the social roots of problematic content and advocate for a strong partnership between social and computer scientists. This collaboration is crucial in developing effective strategies to address the challenges posed by the widespread presence of problematic content [140]. Throughout my thesis, I demonstrate the practical application of this interdisciplinary approach by studying four key aspects of problematic content and creating adaptable solutions for combating such content in digital space using theory-motivated approaches that bridge language technology and social sciences. Firstly, I will introduce my theory-driven, social-group-agnostic framework for mitigating biases in language models [138]. With the increasing prevalence of Large Language Models (LLMs), there is a pressing need to acknowledge the potential risks associated with them generating stereotypical and problematic content. To ensure that these models do not further perpetuate the existing disparities in our society, it is crucial that our efforts to mitigate bias are all-encompassing [17]. However, most existing bias mitigation approaches are social-group specific, which restricts their applicability to less-researched social groups. I propose that Stereotype Content Model (SCM) — a theoretical framework developed in social psychology for understanding the content of stereotyping — can help debiasing efforts to become social-group-agnostic by capturing the underlying connection between bias and stereotypes. I perform debiasing with established methods on both pre-trained word embeddings and large language models and demonstrate that my socialgroup-agnostic, SCM-based debiasing technique performs comparably to group-specific debiasing on multiple bias benchmarks, but has theoretical and practical advantages over existing approaches. Secondly, I propose a novel approach aimed at capturing the dynamic nature of problematic online content [139] aiming to overcome current static approaches’ limitations. For instance, since such approaches often demand large annotated training data, researchers are frequently ill-prepared to respond to emerging cases 13 promptly [197]. In response, I describe my novel continual learning framework and benchmark comprising two essential elements: initially, a series of upstream tasks designed to mimic the evolution of problematic content and gauge a model’s capacity for knowledge acquisition; and subsequently, a distinct collection of few-shot downstream tasks intended to assess a model’s agility in learning novel types of problematic content. I provide empirical evidence demonstrating the effectiveness of this approach. Specifically, I show that continual learning methods outperform static approaches by up to 17% in capturing the evolving content and by up to 4% AUC in adapting to novel forms of problematic content. Thirdly, I emphasize the limitations of English-centric approaches for multilingual settings and show the empirical importance of cultural factors in offensive language detection. This chapter begins by extending my continual learning benchmark to 15 additional languages. Establishing the multilingual baselines on this benchmark revealed frequent negative transfer between tasks. To understand the underlying reasons for this observation, I investigated the influence of cultural and linguistic factors on transfer learning across 105 language pairs, and uncovered several key findings. Firstly, training exclusively on English data impedes performance in certain target languages. Secondly, linguistic proximity between languages does not have a significant impact on transferability. Lastly, there is a significant correlation between cultural distance and performance. Importantly, for each unit increase of cultural distance, there was an increase of 0.3 in the AUC. Lastly, I study the impact of noisy labels present in problematic content datasets on the effectiveness of parameter-efficient solutions in detecting such content. Problematic content datasets, often collected through crowd-sourcing, are susceptible to label inaccuracies. Furthermore, with the ongoing expansion of language model sizes, techniques like Low-Rank Adaptation (LoRA), provide computational efficiency while preserving performance. However, their robustness to label noise remains unexplored. I investigate whether LoRA-tuned models demonstrate the same level of noise resistance observed in fully fine-tuned Transformer models. My investigation has multiple key findings: First, I show that LoRA exhibits robustness to random noise similar to full fine-tuning on balanced data, but unlike full fine-tuning, LoRA does not overfit the noisy data. Second, I observe that compared to full fine-tuning, LoRA forgets significantly fewer data points as noise increases. Third, studying how these robustness patterns change as training data becomes imbalanced, I observe that Transformers struggle with imbalanced data, with robustness declining as imbalance worsens. Overall, this thesis underscores the importance of incorporating social psychology in computational methods to tackle problematic content effectively. By bridging disciplines, we can develop more robust strategies for combating biases in language models, understanding the dynamic nature of online content, creating culturally sensitive models, and dealing with noisy labels. . 14 Chapter 1: Introduction 1.1 Harms of Problematic Content Online Problematic content online, encompassing hate speech, harassment, and negative stereotypes, poses a multifaceted threat to both individual well-being and the health of society. For one, it can have a profound impact on mental health. Exposure to a constant barrage of negativity, prejudice, and hostility can be emotionally draining, leading to anxiety, depression, and feelings of isolation [160]. This is particularly concerning for young people and those already vulnerable. Hate speech and harassment can target specific groups, creating a hostile online environment that excludes and belittles. Negative stereotypes can be internalized, leading to a decline in self-esteem and a sense of marginalization. In extreme cases, this negativity can spiral into self-harm or even violence [28]. The harms extend beyond the individual. Problematic content acts as a social pollutant, fueling societal divisions and hindering progress. By normalizing hate speech and stereotypes, it creates a climate of intolerance and distrust. When negativity goes unchecked, it can become normalized, emboldening those who hold prejudiced views. This can lead to real-world consequences, with online hate translating into discrimination and violence against targeted groups [180]. Additionally, the spread of stereotypes can make it harder to achieve social progress by reinforcing existing biases and hindering efforts to create a more inclusive society. The economic costs of problematic content should not be ignored. A hostile online environment can discourage participation and innovation. When negativity discourages people from expressing themselves freely, it stifles creativity and limits the potential of online communities [187]. Over the last two decades, two significant technological advancements, namely, the emergence of social media platforms and the development of Large Language Models (LLMs), have heightened the challenge of detecting problematic content, thrusting it into the forefront of public consciousness. Social media platforms offer individuals the means to freely express themselves. However, certain features of social media, such as partial anonymity, which may promote freedom of expression, can also result in the dissemination of problematic content [182]. Recent advancements have significantly improved the capabilities of LLMs, rendering them ubiquitous in our lives. Yet, alongside these advancements, LLMs have also shown the ability to learn and amplify problematic content, adding a new layer to this complex issue [132]. Recognizing the gravity of this issue, scholars and industry experts alike have developed various machine 15 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation learning approaches and tools to tackle it [148, 161]. However, while machine learning holds promise in identifying and addressing problematic content, the proposed solutions often overlook the deeper societal and psychological factors underlying the creation and dissemination of problematic content. In essence, addressing problematic content necessitates a multifaceted approach that considers not only technological advancements but also the intricate interplay of societal norms, individual behaviors, and psychological mechanisms. Only through a comprehensive understanding of these factors can effective solutions be devised to foster a safer and more inclusive digital environment. 1.2 The Social Roots of Problematic Content The proliferation of problematic content online transcends the realm of online social media platforms and language models. It represents a deeply rooted societal issue mirrored and exacerbated by the internet. To effectively study and mitigate this issue, it is essential to explore the human psyche, underscoring the importance of social and cognitive psychology in our endeavors to foster a healthier online environment [140]. Social psychology offers a powerful lens into how online communities can become breeding grounds for negativity. The anonymity and perceived distance from real-world consequences embolden users to engage in behaviors they might not offline. This fosters the creation of echo chambers, where like-minded individuals reinforce each other’s biases, demonizing out-groups and creating a ripe environment for the spread of hate speech and stereotypes [33]. Social psychology can inform interventions that bridge these divides. Encouraging interaction between diverse groups, for example, can foster empathy and understanding, chipping away at the foundations of prejudice. But social dynamics are just one piece of the puzzle. Cognitive psychology sheds light on how our own thinking processes contribute to the problem. Confirmation bias, for instance, steers us towards information that confirms our existing beliefs, no matter how hateful or demonstrably false [134]. This creates a selfreinforcing cycle where negativity and misinformation spiral out of control. Cognitive interventions provide a solution here. Techniques that prompt users to question the source of their information or highlight potential biases in their own thinking can equip them with critical evaluation tools [95]. By fostering a more discerning online experience, cognitive psychology can empower us to dismantle the echo chambers that amplify negativity. I argue that collaboration between computer scientists and social and cognitive psychologists is crucial in developing effective solutions to tackle problematic content detection, particularly because problematic content fundamentally stems from societal and psychological factors. By integrating insights from social and cognitive psychology into algorithmic approaches, interdisciplinary teams can better understand the complex interplay between technology, human behavior, and societal dynamics, leading to more comprehensive and impactful solutions for combating problematic content online. In this dissertation, by drawing on the rich literature of social psychology, I propose interdisciplinary solutions to four key challenges faced by technologies aiming at detecting and mitigating the adverse effects of problematic content online. 16 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 1.2.1 Stereotypical Biases of Our Society in Language Models Language Models (LM) inherit the biases of the human societies they reflect, similar to how a student’s knowledge is shaped by their educational materials [25]. These biases become ingrained during training on massive textual corpora, which often encode societal stereotypes [111]. As a result, LMs can perpetuate biases in gender, race, profession, and other areas, potentially amplifying existing inequalities. Social and cognitive psychologists bring deep insights into human behavior, cognition, and social dynamics such as stereotyping. In response to this issue, various methods have been developed to mitigate the stereotypical biases in language models. However, most of these bias mitigation approaches rely on a group-specific strategy. These methods debias a language model with respect to social groups (gender, race, etc.) and suffer from two key limitations: • Lack of Theoretical Foundation: These approaches lack a strong theoretical grounding, hindering a deeper understanding of bias. As a result group-specific debiasing may not be entirely effective. Research suggests that stereotypic relations in language models are complex and may not be easily addressed through the simple removal of group-specific associations. • Scalability Issues: Group-specific methods require resources tailored to each targeted group. Developing and maintaining these resources can be impractical, especially for understudied social groups. Additionally, attempting to debias across all possible groups is practically infeasible and could potentially degrade model performance. The Stereotype Content Model (SCM) [53, 37] is a social psychological theory positing that stereotyping of different social groups can be captured along two orthogonal dimensions, “warmth” and “competence.” The warmth dimension of stereotypes has to do with people’s intentions in interpersonal interactions, while the competence dimension has to do with assessing others’ ability to act on those intentions. Warmth has been typically quantified using surveys and experiments by asking about people’s perceptions of social groups’ friendliness, pleasantness, and morality. On the other hand, competence has been quantified by asking about people’s capability, intelligence, and industriousness. While there are a number of other social psychological theories capturing outgroup biases [e.g., 204, 99], SCM has been shown to predict emotional and behavioral reactions to societal outgroups. Expanding upon this, in chapter 2, I delve into my proposed theory-driven approach based on SCM for social-group-agnostic bias mitigation. 1.2.2 Dynamic Nature of Problematic Content Human societies are inherently dynamic, constantly evolving and adapting to new environments and challenges. Our remarkable ability to adapt to varying social contexts, particularly during hardships like war or pandemics, can be understood through several key social psychological concepts. Firstly, social norms, the unspoken rules that govern behavior within a group, can shift dramatically in response to external threats 17 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [141]. During a war, cooperation and self-sacrifice become essential, leading to a new norm of prioritizing collective well-being over individual needs. However, when individuals prioritize the safety and welfare of their own group above all else, they may view members of the opposing group as less valuable. This mindset can fuel prejudice, discrimination, and ultimately, hatred toward the outgroup. For instance, in times of conflict, propaganda often portrays the enemy in a negative light, emphasizing their differences and framing them as a threat to the collective well-being of one’s own group. Secondly, social learning theory explains how we observe and adopt behaviors from others [11]. Witnessing acts of courage and resilience during hardship can inspire and motivate others to adapt and contribute positively. However, In the context of war, implications of this theory can have negative ramifications. For example, when individuals witness acts of aggression or prejudice towards the outgroup, they may internalize these behaviors as acceptable or even necessary [32]. Finally, social identity theory highlights our desire to belong and maintain a positive self-image within our social groups [173]. However, individuals may exploit group identities for personal gain or to justify harmful actions. For example, in times of economic instability, certain individuals might manipulate group norms to scapegoat marginalized communities, exacerbating social divisions and undermining collective efforts to address underlying issues [22]. This same dynamic extends to the ways we interact with each other online, and expressions of negativity, such as hate speech, prejudice, and incivility online, also adapt and transform alongside the social fabric they inhabit. In chapter 3, I argue that neglecting this dynamic nature has led to the development of automated systems designed with a major limitation in the current approaches and strategies for detecting problematic content: they adopt a static perspective, primarily focusing on problematic content in English. While the status quo approach has proven effective in curbing the spread of problematic content in the U.S., it is important to acknowledge the inherent limitations associated with the static view. This static approach has led to heterogeneity among both the methods and the resources [197]. For instance, since such approaches often require large annotated training data, researchers are frequently ill-prepared to promptly respond to emerging cases, such as the upsurge in new forms of xenophobia during the COVID pandemic. This heterogeneity has also resulted in inconsistencies in reports regarding progress in the automatic detection of problematic content, complicating the comparison of competing methods and hindering the identification of effective strategies for advancement. Acknowledging the dynamic nature of problematic content, in chapter 3, I describe how I designed and developed a benchmark and framework based on continual learning that integrates various resources across different platforms and timelines capturing the dynamic nature of the issue of problematic content detection. 1.2.3 Cultural Elements in Perceptions of Offensiveness Social psychology also delves into the interplay between individuals and their social environments, exploring how our thoughts, feelings, and behaviors are shaped by the groups we belong to and the broader societal context [12]. Through the lens of social psychology, we can appreciate how the very definition of “problematic content” can be fluid and subjective, varying across different cultures, contexts, and even within the same 18 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation society over time [171]. Culture encompasses a spectrum of pragmatic solutions devised by societies to tackle survival challenges [142], often operationalized as distributed patterns of mental representations among populations [10]. These cultural solutions manifest in various beliefs, values, norms, and practices [20]. One dimension of cultural divergence is the dichotomy between individualism and collectivism [179]. Individualistic cultures prioritize autonomy, distinctiveness, and the pursuit of uniqueness. Conversely, collectivistic cultures emphasize unity, conformity, communal harmony, and shared responsibility [143, 120]. A significant area where these cultural orientations diverge is in perceptions of offense, including the nature of offenses, the intensity of emotional reactions they evoke, and perspectives on appropriate retribution [117]. Collectivist cultures view offenses against communal entities, such as national symbols, religious beliefs, or family honor, as severe threats to social cohesion [96]. Conversely, individualistic cultures weigh offenses against personal achievements, professional reputation [68], or individual identity (e.g., gender or sexual orientation) with equal seriousness. While the individualism vs. collectivism framework offers insights into cultural psychology concerning offense, it overlooks other dimensions of cultural distinctions, such as a society’s tolerance for norm violations, referred to as the tightness-looseness dimension [59], which shapes perceptions of offensive language. Recent studies have neglected the importance of cultural context and instead promoted a universal approach, employing English data to improve the effectiveness of offensive language classifiers in languages with limited resources [156]. As a result, as shown in recent research by Lee et al. [102], hate speech classifiers lack cultural sensitivity. Moreover, challenges described in subsection 1.2.2 become even more pronounced due to resource limitations in non-English languages. This is primarily because, within this static framework, transferring knowledge from one language to another poses a non-trivial challenge. As a result, even tech companies, such as Meta, have language blind spots in their problematic content detection [34]. In chapter 4, I systematically investigate the influence of linguistic and cultural similarities on the cross-lingual transferability of hate speech and offensive language detection and show that cultural diversity among datasets improves the positive transfer in offensive language detection among them. 1.2.4 Human Annotator Errors The success of most solutions developed for automatically detecting problematic content heavily hinges on the availability of annotated datasets [148]. These datasets operate under the assumption that by consolidating the knowledge of a crowd of human annotators, we can derive reliable estimates of the true labels. However, human cognition and decision-making processes are inherently imperfect and can result in noisy labels. Cognitive psychology provides insights into the myriad of factors contributing to label noise, ranging from insufficient evidence to perceptual errors [122]. Fr´enay and Verleysen [57] categorize potential sources of label noise into four distinct categories, shedding light on the complexity of the problem. Firstly, label noise often emerges when the information provided to annotators lacks sufficient detail, rendering the resulting labels unreliable. For instance, accurately labeling a post for offensive language may prove challenging without the broader context of the conversation [194]. Rottger et al. [158] also note that 19 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation the annotation manual’s lack of elaboration or prescription can contribute to unreliable labeling. Secondly, errors can stem from the involvement of non-experts, typically hired through crowdsourcing platforms to mitigate annotation costs. These annotators may lack the expertise necessary to make accurate judgments, introducing inconsistencies into the labeled data. Thirdly, tasks such as offensive language detection inherently involve subjectivity, with no single ground truth, leading to significant variation in labels assigned by individual annotators. This subjectivity underscores the challenges in achieving consensus, as different annotators may interpret the same content differently. Lastly, label noise may occur due to data encoding issues, where posts are mistakenly flagged as offensive due to accidental clicks or other technical glitches. These encoding errors further exacerbate the challenge of ensuring high-quality labeled datasets. In chapter 5, I investigate the impact of label noise on the state-of-the-art methods for problematic content detection. 20 Chapter 2: Theory-Driven Bias Mitigation in Language Models The societal impacts of Natural Language Processing (NLP) have stimulated research on measuring and mitigating the unintended social-group biases encoded in language models [76]. However, the majority of this important line of work is atheoretical in nature and “fails to engage critically with what constitutes ‘bias’ in the first place” [16]. The bias found in language models is rooted in human biases [25]; thus, to alleviate such biases, we should ground our debiasing approaches in social psychological theories of stereotyping. These theories can help us shed light on the underlying structure of language-embedded biases rather than attending to ad hoc superficial patterns [140]. Although there is a multitude of approaches to bias mitigation [18, 201, 46, 86, 168], most of these approaches are group-specific. Such methods, which debias along subspaces defined by social groups or attributes (e.g., gender or race), are not only atheoretical but also unscalable. Resources developed for bias mitigation on one social group or attribute (e.g., gender) do not axiomatically translate easily into other groups or attributes (e.g., age). For example, previous works’ focus on gender bias has driven the development of resources that are only applicable to gender debiasing (e.g., equality word sets for gender), but biases associated with other social groups and/or attributes remain understudied. Beyond the challenge of creating such resources for a given attribute, to achieve an “unbiased” model with group-specific debiasing, one would have to iterate over all social groups. This approach is practically impossible and arguably would result in significant degrading in the expressiveness of the model. Furthermore, group-specific debiasing is limited in terms of effectiveness: stereotypic relations in distributed representations are deep-rooted, and thus may not be easily removed using explicit sets of group-specific words [3, 64]. In contrast, a social-groupagnostic approach would not have such restrictions. In Omrani et al. [138], I proposed that by relying on a theoretical understanding of social stereotypes to define the bias subspace (rather than group-specific subspaces), bias in pre-trained word embeddings and large language models can be adequately mitigated across multiple social attributes (Figure 2.1). The Stereotype Content Model [SCM; 53] is a theoretical framework developed in social psychology to understand the content and function of stereotypes in interpersonal and intergroup interactions. The SCM proposes that human stereotypes are captured by two primary dimensions of social perception: warmth (e.g., trustworthiness, 21 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Group-Specific Group Agnostic (Our Approach) Debiasing Goal Group Specific Terms Gender men,woman Race White,black,asian Gender Race Religion Age Gender Race Religion Age Alg. Alg. Gender Race Religion Age Alg. SCM terms genuine,fake smart,stupid Warmth Competence competence Jewish Middle class Rich warmth Women Christians Gay men Whites Blacks Muslims Hispanics Men Asians Jewish Middle class Rich warmth competence Women Christians Whites Gay menBlacks Muslims Hispanics Men Asians W C W W C C Mitigated Bias General Debiasing Figure 2.1: Group-specific debiasing vs. our proposed group-agnostic approach. Rather than iteratively debiasing with respect to each social attribute (e.g., gender or race), embeddings or language models are debiased with respect to warmth and competence, the two dimensions of the Stereotype Content Model (SCM). friendliness) and competence (e.g., capability, assertiveness). From a socio-functional, pragmatic perspective, people’s perception of others’ intent (i.e., warmth) and capability to act upon their intentions (i.e., competence) affect their subsequent emotion and behavior [37]. Warmth divides people (or groups of people under a social identity) into “friends” or “foes,” while competence contains perceptions of social groups’ status. Depending on historical processes, various social groups may be located in different stereotypic quadrants (high vs. low on warmth and competence) based on this two-dimensional model [31]. Specifically, I confirm that by debiasing with respect to the subspace defined by warmth and competence, our SCM-based approach performs comparably with group-specific debiasing for a given group (e.g., SCMbased debiasing is comparable to race-debiasing on race, see §2.2.5). I also show that SCM-based debiasing simultaneously reduces bias for understudied attributes such as religion and age (§2.2.5 and §2.3.1). Finally, I show that for both word embeddings and large language models SCM-based debiasing retains original model’s expressiveness (§2.2.6 and §2.3.1). Overall, my results confirm the viability of a theory-based, social-group-agnostic approach to bias mitigation. 22 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 2.1 Background 2.1.1 Bias Mitigation in Word Embedding Part of our work builds on post hoc bias mitigation which aims to remove biases by modifying pre-trained word embeddings and language models. Most efforts we review are restricted to gender-related debiasing [e.g., 18, 201, 46]; importantly, we focus our work on other social categories as well, bringing attention to these understudied groups and attributes. Originally, Bolukbasi et al. [18] proposed Hard Debiasing (HD) for gender bias. HD removes the gender component from inherently non-gendered words and enforces an equidistance property for inherently gendered word pairs (equality sets). Two follow-ups to this work include: Manzini et al. [118], which formulated a multiclass version of HD for attributes such as race; and Dev and Phillips [46], which introduced Partial Projection (PP), a method that does not require equality sets and is more effective than HD in reducing bias. Extending these approaches to other social attributes is not trivial because a set of definitional word pairs has to be curated for each social group/attribute; this curation is a non-trivial task as the list of words required is dynamic and context-dependent. Gonen and Goldberg [64] demonstrated that gender bias in word embeddings is deeper than previously thought, and methods based on projecting words onto a “gender dimension” only hide bias superficially. They showed that after debiasing, most words maintain their relative position in the debiased subspace. Our work is motivated by this important criticism. Specifically, we argue that our theory-driven approach removes the actual psychological ‘bias’ subspace, rather than subspaces, often superficially, containing bias for specific groups. 2.1.2 Bias Mitigation in Language Models Fine-tuning and prompt-tuning are two major paradigms for using pre-trained language models in NLP, and bias mitigation approaches have been proposed based on both. Kaneko and Bollegala [87] introduced Debiasing Pre-trained Contextualized Embeddings (DPCE), a finetuning method for bias mitigation in language models with a loss term that aims to find a balance between bias mitigation and language modeling ability. Other fine-tuning methods include AutoDebias [69], which samples the language model to find examples for finetuning-based debiasing. Promp-tuning [107, 112, 104] is mainly done through either discrete prompts, which consist of text (i.e., tokens), or continuous prompts, which consist of a continuous array of numbers prefixed to a language model and trained. It has been shown that bias in language models can be mitigated by providing descriptions of desired and undesired behavior as discrete prompts [162, 6, 168]. More recently Yang et al. [196] proposed “A DEbiasing PrompT” (ADEPT) that outperforms DPCE by using continuous prompt tuning to mitigate biases in a language model. 23 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 2.1.3 The SCM and Language SCM is a well-established theoretical framework of stereotyping, and has begun to be applied in NLP. Recently, Nicolas et al. [135] developed dictionaries to measure warmth and competence in textual data. Each dictionary was initialized with a set of seed words from the literature which was further expanded using WordNet [127] to increase the coverage of stereotypes collected from a sample of Americans. Cao et al. [26] employ Agency-Belief-Communion [ABC; 99], an alternative theory from social psychology for stereotype content, to discover stereotyping in language models. Fraser et al. [56] demonstrated that, in word embeddings, SCM dictionaries capture the group stereotypes documented in social psychological research. Recently, Mostafazadeh Davani et al. [128] applied SCM dictionaries to quantify social group stereotypes embedded in language, demonstrating that patterns of prediction biases can be explained using social groups’ warmth and competence embedded in language. Contemporary to our work, Ungless et al. [181] explore the idea of using SCM for bias mitigation but fall short of evaluating their models on the necessary benchmarks. 2.2 SCM-based Bias Mitigation for Word Embeddings Before we discuss our proposed method, we briefly review the algorithms and benchmarks on bias mitigation in word embeddings. There are two components to each post hoc bias mitigation approach for static word embeddings: the Bias Subspace, which determines the subspace over which the algorithms operate, and the Algorithm, which is how the word embeddings are modified with respect to the bias subspace. In this section, we review the concept of bias subspaces, established algorithms for debiasing, and how bias is quantified in word embeddings. Finally, we introduce our social-group-agnostic framework; SCM-based debiasing. 2.2.1 Identifying a Bias Subspace Post hoc word embedding debiasing algorithms operate over a subspace of bias in the embedding space. Given a set D = {(d + 1 , d− 1 ), ...,(d + n , d− n )} of word pairs that define the bias concept (e.g. “father”–“mother” for binary gender) the bias subspace vB is the first k principal components of matrix C, constructed from stacking the difference in embeddings of d + i and d − i . 2.2.2 Debiasing Algorithms Method definitions below use the following notation: W denotes vocabulary, w and w’ denote the embedding of word w before and after debiasing. Hard Debiasing (HD) An established approach for mitigating bias in word embeddings is Hard Debiasing [HD; 18]. For gender, HD removes the gender subspace from words that are not inherently gendered by projecting them orthogonal to gender subspace. For word pairs that are inherently gendered, HD equalizes them, modifying the embeddings such that they are equidistant from the inherently non-gendered words. 24 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Subtraction (Sub) Sub was introduced as a baseline by Dev and Phillips [46] wherein the bias subpspace vB is subtracted from all word vectors. Formally, for all w ∈ W,w ′ := w − vB. Linear Projection (LP) To mitigate the bias with respect to bias dimension vB, Linear Projection (LP) projects every word w ∈ W to be orthogonal to vB. Formally, w ′ := w − πB(w) where πB(w) = ⟨w, vB⟩vB is the projection of w onto vB. Partial Projection (PP) To improve on LP, Partial Projection (PP) was developed to allow the extent of projection to vary based on the component of the given word vector which is orthogonal to the bias subspace. Intuitively, only words with unintended bias (e.g., “nurse” or “doctor”), and not words which are definitional to the bias concept (e.g., “man” or “woman”) will have a large orthogonal component to the bias subspace vB. For all words w ∈ W, w ′ = µ + r(w) + β · f(∥r(w)∥) · vB β = ⟨w, vB⟩ − ⟨µ, vB⟩ where µ is the mean embedding of words used to define vB, r(w) = w − ⟨w, vB⟩vB is the bias-orthogonal component, and f(.) is a smoothing function which helps to remove unintended bias and keep definitional bias. We use f(η) = 1 (η+1)2 [see 46]. 2.2.3 Static Word Embedding Benchmarks Embedding Coherence Test Given a set of tuples A = {(a 1 1 , a2 1 , ...),(a 1 2 , a2 1 , ...)} where a i j denotes the j th word for i th subgroup of an attribute (e.g., {(“father”, “mother”), ...} for binary gender), and a set of professions P = {p1, ..., pm}, the Embedding Coherence Test [ECT; 46] is the Spearman rank correlation between the rank order of cosine similarities of professions with each subgroup’s average embedding. Bias is completely removed when subgroups have identical ordering of associations with professions (ECT = 1). Embedding Quality Test Word analogies are one of the main methods for evaluating word embeddings [124]. The EQT [46] quantifies the improvement in unbiased analogy generation after debiasing. Similar to ECT, EQT requires a set of word pairs A and a set of professions P. For each word pair (a + i , a− i ) the analogy a + i : a − i :: pj is completed, if the answer is pj or plurals or synonyms of pj (via NLTK; [15]), it is counted as unbiased. EQT is the ratio of unbiased analogies to all analogies. An ideal unbiased model would achieve EQT= 1 while lower values indicate a more biased model. 2.2.4 Proposed Method To identify a group-agnostic bias subspace, we use the warmth and competence dictionaries from [135]. To construct the poles of the dimensions, “high” and “low” word pairs (e.g., “able”–“unable” for competence and “sociable”–“unsociable” for warmth) were selected by down-sampling to 15 word pairs, per dimension. 25 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation We use word pairs for each SCM dimension to identify an SCM subspace (see Section 2.2.1), and subsequently apply the methods from Sec. 2.2.2. We test whether SCM-based debiasing can substitute group-specific debiasing simultaneously for gender, race, and age. This is broken down into two related research questions. First, does SCM-based debiasing remove a comparable amount of bias relative to group-specific debiasing? And second, does SCM-based debiasing have more or less of a negative effect on embedding utility [18]? We compare SCM-based debiasing to group-specific debiasing using previous debiasing methods, specifically HD, Sub, LP, and PP (Section 2.2.2), and evaluate bias as measured by ECT and EQT following Dev and Phillips [46]. In addition, we evaluate the performance of each set of debiased embeddings on established word embedding benchmarks [82]. 2.2.5 Results - Bias Reduction Vanilla HDsame Subsame SubSCM LPsame LPSCM PPsame PPSCM PPG+R+A ECTgender 0.83 0.92 0.83 0.83 0.82 0.83 0.99 0.97 0.99 ECTrace 0.69 - 0.51 0.52 0.70 0.74 0.99 0.96 0.99 ECTage 0.30 - 0.23 0.34 0.60 0.34 0.96 0.95 0.99 EQTgender 0.075 0.056 0.071 0.072 0.081 0.073 0.063 0.049 0.059 EQTrace 0.042 - 0.032 0.036 0.051 0.044 0.061 0.056 0.073 EQTage 0.052 - 0.043 0.041 0.062 0.051 0.063 0.047 0.057 Table 2.1: ECT and EQT for gender, race, and age. Subscript “same” denotes the debiasing was performed with respect to the corresponding dimension (e.g. PPsame denotes PP was applied to gender for ECTgender.) and subscript “SCM” refers to debiasing with respect to the SCM subspace. Debiasing was repeated 30 times for each method, and bold values indicate higher scores (per method) with non-overlapping 95% confidence intervals. HD was limited to gender because of other dimensions’ lack of equality sets. Analogy ↑ Similarity ↑ Google MSR WS353 RG-65 Vanilla 0.39 0.45 0.50 0.50 PPGender 0.31 0.36 0.49 0.37 PPGender+Race 0.27 0.33 0.43 0.30 PPG+R+A 0.25 0.30 0.40 0.27 PPSCM 0.29 0.34 0.42 0.33 Table 2.2: Embedding utility for debiased models. We investigate whether SCM-based debiasing can simultaneously debias word embeddings with respect to gender, race, and age. For a given bias dimension, we established baselines by applying HD, Sub, LP, and PP using the respective word pair list (e.g., for gender bias we used gender word pairs), denoted with the subscript “same.” To place an upper bound on removed bias, we perform PP using gender, race, and age word lists (PPG+R+A). For race and age we used the lists from Caliskan et al. [24], while gender lists were taken from Bolukbasi et al. [18]. All methods were repeatedly applied using 30 different word pair samples, and we report each measure’s average and compare values using 95% confidence intervals. Implementation details are provided in the Appendix. 26 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Table 2.1 shows the results of our experiments. Overall, SCM-based debiasing performs comparably to social-group-specific debiasing across methods. Specifically for ECT, SCM-based debiasing was either better than, or not statistically different from, LPsame and Subsame, while SCM-based debiasing was only slightly out-performed by PPsame (0.01–0.03). In other words, these results demonstrate that warmth and competence dimensions can simultaneously capture gender, race, and age bias in word embeddings. For the EQT, results are somewhat similar to those of ECT; however, we caution against interpreting small differences in EQT due to its definition of biased analogies relying on NLTK to compile comprehensive sets of synonyms and plural forms of words [46]. 2.2.6 Results - Word Embedding Utility Table 2.1 shows that PPG+R+A outperformed all other methods on bias evaluations. However, one tradeoff is the reduction in word embedding utility. Table 2.2 shows that PPSCM preserves more embedding utility than PPG+R+A, using established benchmarks for analogy and similarity [82]. Due to the information removed in the debiasing process, as the number of social attributes increases, the quality of embeddings for group-specific debiasing deteriorates; however, this is not the case for PPSCM showing that PPSCM preserves some of the definitional biases (e.g. gender bias of actor vs. actress). These results indicate that our proposed approach for SCM-based bias mitigation is a better solution especially when our goal is to remove social biases for as many groups as possible. 2.3 Mitigating Biases in Language Models Methods of post-hoc bias mitigation Post-hoc bias mitigation in language models requires a definition of the bias subspace. The bias subspace is defined via a set of attribute word tuples A = {(a 1 1 , a2 1 , ...),(a 1 2 , a2 1 , ...)} where a i j denotes the j th word for i th subgroup of an attribute (e.g., for religion { (“Muslim”-“Jewish”-“Chrisitan”), ...}). In addition, bias mitigation for language models requires a set of neutral target words T (e.g., occupations such as “doctor”, “nurse”, etc.) Given a language model Mθ the goal is to find Mθ ′ such that the difference in association of each neutral target word with all subgroups is minimized. Our goal is to provide a theory-driven framework for bias mitigation that generalizes to many social groups. Therefore, we chose to focus on top-performing post-hoc methods that do not require significant hand-designed resource development, as our baselines. Both bias mitigation methods we used, namely “Debiasing Pretrained Contextual Embeddings” [DPCE; 87] and “A DEbiasing PrompT Framework” [ADEPT; 196], view bias mitigation as a downstream task and use a loss L = Lbias + Lrepresentation that balances bias mitigation – via Lbias – and preserves a model’s representational power via Lrepresentation. The key difference of these approaches is that DPCE, minimizes this loss via fine-tuning while ADEPT relies on prompt-tuning to achieve this goal. Note that both algorithms collect a set of sentences S w for each word w to capture the contextual representation of w. We use Ei(w, s; θ) 27 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Algorithm Lbias Lrepresentation DPCE P t∈T P s∈St P a∈A(ei(a) T Ei(t; s; θe))2 P i,j∈{1,...,d},i<j{JS(P a(i) ||P a(j) )} ADEPT P i,j∈{1,...,d},i<j{JS(P a(i) ||P a(j) )} KL(Mθ(S)||Mθ ′ (S)) Table 2.3: Summary of Bias Mitigation Algorithms used for Language Models .to show the embedding of a word w in sentence s in the i th layer of the model parameterized by θ. Then the layer-prototype of a word w in i th layer of a model, ei(w), is defined as the average of Ei(w, s; θ) for all s ∈ S w. The prototype of a word is then defined as the average of all layer-prototypes. Table 2.3 provides the formal definition of loss for both algorithms. Bias Benchmarks We utilized three different bias benchmarks to evaluate the efficacy of bias mitigation. SEAT: The Sentence Encoder Association Test [SEAT; 121] is the extended version of the Word Embedding Association Test [WEAT; 24], which places the WEAT words to the pre-determined sentences and computes the effect size and p-value. Effect sizes closer to zero indicate lower magnitude of bias. CrowS-Pairs: Crowdsourced Stereotype Pairs [CrowS-Pairs; 133] evaluates whether a model gives a higher probability to the stereotypical sentences over the anti-stereotypical sentences. CrowS-Pairs test set consists of pairs of sentences that target explicit expressions of stereotypes by changing the stereotyped word with an anti-stereotype word. The ideal model should achieve a score of 50. StereoSet: StereoSet [132] provides three scores, Stereotype Score (SS), Language Modeling Score (LMS), and Idealized CAT (ICAT) Score. For each item in Stereoset, the model should choose between stereotypical, anti-stereotypical or unrelated variations. The SS is the percentage of sentences in which the model prefers stereotypical ones. The LMS is the percentage of sentences the model prefers, stereotypical or antistereotypical sentences, over unrelated. The ideal model should get a SS of 50% and LMS of 100%. The ICAT combines the SS and LMS, and the ideal model should get 100% on ICAT. Proposed Method We propose that SCM’s warmth and competence can be used to define a social-group-agnostic bias subspace in language models. We operationalize the warmth and competence dimensions by 16 pairs of “high” and “low” words for each dimension from Nicolas et al. [135]. We hypothesize that SCM-based debiasing in contextualized language models will: 1. reduce bias for multiple social groups comparable to group-specific bias mitigation and generalizes to understudied social groups or attributes (§2.3.1), and 2. maintain the expressiveness of language models (§2.3.2). To test our hypotheses, we mitigate bias with DPCE and ADEPT relying on SCM and compare it with social-group-specific debiasing on gender and religion on multiple bias benchmarks (§2.3). In addition to bias benchmarks, we compare the models on GLUE benchmark [188] to evaluate whether the debiased models have the same expressiveness as the original model. 28 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Race Gender Religion S3 S4 S5 S6 S7 S8 SR1 SR2 BERTLARGE 0.42∗ 0.41∗ 0.89∗ 0.37 0.42∗ -0.26 0.01 -0.16 DPCE Gender 0.39∗ 0.61∗ 0.74∗ 0.72∗ -0.20 -0.26 -0.18 -0.20 Religion 0.30∗ 0.76∗ 0.79∗ -0.43 0.69∗ 0.67∗ 0.01 -0.12 SCM 0.25∗ 0.41∗ 0.32∗ 0.72∗ -0.42 -0.08 -0.24 -0.15 ADEPT Gender 0.35∗ 0.25 0.82∗ 0.70∗ -0.34 0.15 0.26 -0.21 Religion 0.64∗ 0.54∗ 0.88∗ 0.66∗ 0.44∗ 0.51∗ 0.61∗ 0.17 SCM 0.37∗ 0.21 0.63∗ 0.81∗ 0.82∗ 0.50 0.50∗ -0.10 Table 2.4: SEAT effect sizes for selected tests on Race (S3, S4, S5), Gender (S6, S7, S8), and Religion (SR1, SR2). *: p¡ 0.01 . Datasets and Experiment Details: For all experiments, we use the same neutral target words as previous debiasing methods [87]. To mitigate gender bias, we use binary gender words from Zhao et al. [201], and to mitigate bias on religion, we use ternary religion words from Liang et al. [110]. Words for SCM-based debiasing are included in Appendix B.1. We use two corpora to collect sentences for each word: 1. News-Commentary v15 (NC-v15)1 and 2. BookCorpus (BC)2 . We collect the sentences for gender words from NC-v15, and for religion and SCM words we use NC-v15 combined with BC. We extract 58,252 neutral sentences and 14,688 attribute sentences for each gender subgroup, 4,949,126 neutral sentences and 6,485 attribute sentences for each religion subgroup, and 4,650,778 neutral sentences and 35,064 attribute sentences for each SCM subgroup. All models are repeatedly debiased using a random sample of sentences. 2.3.1 Results - Bias Reduction We compare variations of each algorithm (DPCE or ADEPT) separately to disentangle the effect of SCMbased debiasing from the debiasing algorithm (i.e., fine-tuning or prompt-tuning). We evaluate biases of our models using three benchmarks (§2.3) spanning four social groups/attributes of race, gender, religion, and profession. DPCE + SCM: On Stereoset, our results show that SCM-based bias mitigation achieves a higher ICAT score compared to group-specific debiasing for all categories of gender, profession, race, religion, and overall (Table 2.5). Table 2.4 shows our result on SEAT also reflects the same pattern. With the exception of S6, in all cases, SCM-based DPCE results in smaller effect sizes (or is insignificant). Finally, as shown in Table 2.5, for CrowS-pairs, SCM-based bias mitigation with DPCE achieves a better score on two of the three categories. Altogether, these results demonstrate that for DPCE, SCM-based debiasing can mitigate social biases on multiple social groups/attributes on par with or better than group-specific debiasing even when evaluated on the explicitly targeted group. 1https://www.statmt.org/wmt20/translation-task.html. 2https://huggingface.co/datasets/bookcorpus 29 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Benchmark Task Metric BERTLARGE DPCE ADEPT Gender Religion SCM Gender Religion SCM StereoSet Gender LMS ↑ 86.5 84.7 75.5 83.3 86.0 85.2 85.4 SS → 50 63.2 59.9 57.9 58.7 58.3 59.7 59.5 ICAT ↑ 63.6 68 63.5 68.8 71.7 68.7 69.2 Profession LMS ↑ 84.8 82.2 76.7 82.6 85.1 85.1 85.3 SS → 50 59.4 57.3 55.7 55.5 56.7 56.3 56.2 ICAT ↑ 68.8 70.3 67.9 73.5 73.6 74.4 74.7 Race LMS ↑ 83.6 82.9 81 83.5 84.2 83.6 83.6 SS → 50 57.1 56.3 55.8 55.9 53.0 55.0 54.1 ICAT ↑ 71.7 72.6 71.7 73.7 79.3 75.3 76.6 Religion LMS ↑ 84.3 80.1 79.9 83.5 82.7 81.9 81.9 SS → 50 59.9 57.8 58.8 57.9 55.9 58.1 56.3 ICAT ↑ 67.5 67.7 65.8 70.3 72.8 68.6 71.5 Overall LMS ↑ 84.4 82.8 78.7 83.1 84.7 84.3 84.4 SS → 50 58.8 57.1 56.1 56.1 55.1 56.2 55.6 ICAT ↑ 69.5 71 69 72.9 76.0 73.9 74.8 CrowS-Pairs Gender Score → 50 55.7 53.4 55.7 52.7 50.4 55.73 54.2 Race Score → 50 60.4 59 58.4 68.6 61.6 61.94 61.2 Religion Score → 50 67.6 69.5 65.7 60.2 65.7 63.81 64.8 Table 2.5: StereoSet and CrowS-pairs benchmarks on BERTLARGE, and debiased models. We debiased the original model by DPCE and ADEPT with three different sets of attributes, Gender, Religion, and SCM. The best score on each algorithm based on the significant level on ten runs is in bold. → 50 means the ideal model score is 50. ADEPT + SCM: SCM-based bias mitigation with ADEPT performs better than mitigating biases on religion in almost all cases for Stereoset, except for SS on gender. When compared to mitigating gender bias using ADEPT, SCM achieves comparable LMS, SS, and ICAT scores overall. Surprisingly, we observe that SCM-based deibiasing, compared to other models, achieves better scores on race on SEAT. However, it is gender-debiased model that achieves the best results on race for stereoset. This could be in part caused by the fact that race items in stereoset contain multiple instances of gender related stereotypes (e.g. “woman often notice how ¡blank¿ the egyptian men was” with “sexiest”, “generous”, and “lobster”). On CrowS-Pairs, SCM achieves better or comparable scores on race and religion to the best models, but mitigating biases on gender seems to be more effective for gender. As for SEAT, on average ADEPT with SCM gets lower effect sizes for race and comparable effect sizes for religion, but ADEPT with gender results in lower effect sizes on gender. Across all benchmarks, our SCM-based approach consistently achieves a better score compared to at least one of the group-specific models. This suggests that our SCM-base approach is a better solution for minimizing the potential harms of language models for all social groups/attributes, and their potential intersections. 30 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 2.3.2 Results - Natural Language Understanding SST-2 MRPC STS-B RTE WNLI BERTLARGE 91.2 90.7/86.8 90.2/90.0 73.3 56.3 DPCE Gender 92.4 91.3/87.7 90.3/89.8 61.4 47.9 Religion 93.6 90.2/86.0 90.4/90.0 68.2 50.7 SCM 93.2 89.4/84.8 90.8/90.5 72.6 47.9 ADEPT Gender 93 90.5/86.5 89.9/89.6 72.6 56.3 Religion 93.2 91.0/87.5 89.9/89.6 73.3 56.3 SCM 93.7 89.6/85.5 89.8/89,7 72.9 56.3 Table 2.6: GLUE benchmark for Language Models. Following Kaneko and Bollegala [87], we used five tasks from GLUE benchmark, Stanford Sentiment Treebank [SST-2; 167], Microsoft Research Paraphrase Corpus [MRPC; 48], Semantic Textual Similarity Benchmark [STS-B; 30], Recognising Textual Entailment [RTE; 38, 70, 61, 62], and Winograd Schema Challenge [WNLI; 105]. Table 2.6 shows that debiased models preserve their expressiveness. All models achieve comparable or better scores than the original model on all five tasks. We speculate that this is due to the Lrepresentation loss component in both algorithms. This analysis suggest that for LMs SCM-based debiasing not only doesn’t deteriorate the models but also improves their performances on some benchmarks. 2.4 Discussion In this chapter, I demonstrated the viability of a theory-driven approach to debiasing pre-trained word embeddings and language models. By replacing the ad hoc, social-group-specific component of existing debiasing techniques with a general, theory-driven, social-group-agnostic counterpart, I have solved two concrete problems with prior debiasing work and opened the door for more research into theory-driven approaches. First, SCM-based debiasing was shown to sufficiently reduce bias across social attributes, without relying on any manually constructed group-specific resources. Second, it is evident from our results that SCM-based debiasing is scalable with respect to generic social attributes – and embeddings or LMs debiased with respect to SCM can be thought of as generally debiased. Importantly, SCM-based debiasing results in improvements on the quality of the respective word embeddings or language models. The word embeddings and language models used in this chapter are trained on contemporary English language, and our social contexts overly contain explicit stereotypes encoded in English. Stereotypes for a specific group can be quite different depending on the language and culture. Although out of the scope of the present chapter, cross-societal differences in human stereotyping have been shown to be explainable using the SCM framework [37]. Thus, it is fair to posit that our SCM-based framework generalizes to social group biases beyond those in English. Future research is encouraged to replicate our study in non-English 31 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation languages. Furthermore, we would like to point out that there exists a catalogue of bias measurements for word embeddings and language models in the field. However, the current catalogue is far from comprehensive in covering social groups even in the contemporary English/American context, with few resources for the intersectionality of groups and attributes. Additionally, some of these measures have been shown to fail robustness checks. Although this chapter uses some of the most recently developed ECT and EQT, we believe that few, if any, of these measurements are completely sound nor complete. In our experiments for language models, we tried to measure bias for the same social group or attribute using multiple benchmarks but still found some substantial differences in results across benchmarks. While developing a new bias measurement scale is not within the scope of this chapter, we are optimistic that the social psychological theory in which our approach is grounded provides the bedrock for the current evidence of SCM efficacy to hold on future benchmarks. Unlike bias mitigation methods for static word embeddings, such as partial projection, the post hoc methods of debiasing for large language models can’t be trivially applied to mitigate biases for multiple social attributes simultaneously. For DPCE, the formulation allows for mitigating biases on multiple social attributes, but collecting enough sentences from each attribute that do not include any words from other attributes or neutral words (i.e. mutually exclusive sentences) was not possible with the corpora we experimented with. This problem is exacerbated as the number of social attributes grow due to the mutual exclusivity condition. For ADEPT on the other hand, the formulation did not trivially handle multiple dimensions. Hence, we employed a coordinate-descent modification in our experiments to apply ADEPT to SCM (more info in §A.2.4). We encourage future work to devise data-efficient methods that can mitigate biases on multiple dimensions at the same time. 32 Chapter 3: Embracing the Dynamic Nature of Problematic Content Social media enables free expression, but it also allows for the dissemination of problematic content, such as hate speech and harassment, due to partial anonymity, leading to discrimination and exclusion. Researchers and social media companies recognize this issue and have crafted a range of resources and strategies to combat it, including the development of automated systems designed to identify problematic content [148]. Nevertheless, a major limitation of the current approaches and strategies is that they adopt a static perspective, with a predominant focus on problematic content in English [197]. These static approaches tend to fixate on a narrow definition of problematic content within a specific platform during a particular period. While the status quo approach has proven effective in curbing the spread of problematic content in the U.S., it is important to acknowledge the inherent limitations associated with the static view. Our social contexts continuously evolve and adapt to new situations. As a result people’s expressions of hate, toxicity, and incivility, among other types of biases and prejudices, undergo adaptations in response to such changing circumstances [172]. Even in stable social situations, differences in countries, contexts, languages, and perspectives shape the boundaries of what is considered problematic content [98]. For example, Germany criminalizes Holocaust denial due to its incitement to hatred, whereas in the United States and the majority of other European nations, such legislation does not exist. This static approach has also led to heterogeneity among both the methods and the resources. For instance, since such approaches often demand large annotated training data, researchers are frequently illprepared to promptly respond to emerging cases, like the upsurge in new forms of xenophobia during the COVID pandemic. This heterogeneity has also led to inconsistencies in reports regarding progress in the automatic detection of problematic content [197]. These inconsistencies often complicate the comparison of competing methods and hinder the identification of effective strategies for advancement. Moreover, these challenges become even more pronounced due to resource limitations in non-English languages. This is primarily because, within this static framework, transferring knowledge from one language to another poses a non-trivial challenge. As a result, even tech companies, such as Meta, have language blind spots in their problematic content detection [34]. The overarching goal of this proposal is to establish a continual learning framework that integrates various resources across different languages while capturing the dynamic nature 33 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation of the issue of problematic content detection. To lay the foundation for the development of this framework, it is crucial to initially establish a benchmark that consolidates existing resources and accurately reflects the dynamic aspect of problematic content. Continual Learning (CL) refers to a class of machine learning methods that involve the continuous training of models on a stream of tasks, enabling them to adapt and assimilate new knowledge while preserving past learned tasks [155, 144]. I propose a novel perspective grounded in the CL framework to address the dynamic nature of problematic content detection. This approach would allow models to accumulate knowledge from a wide array of existing resources, as well as newly curated ones contributed by the community. This acquired knowledge, in turn, equips the model for rapid adaptation to the ever-changing landscape of online content. Below, I first provide an overview of my efforts to establish a continual learning benchmark for detecting problematic content in English. I then present my experiments, including implemented methods and results, demonstrating the superiority of my proposed continual learning framework compared to existing static perspectives. Finally, I outline my plan to investigate the influence of data quality on this framework and discuss its expansion to multilingual settings. Figure 3.1: Current static approaches (I) train and evaluate models on a fixed set of datasets. Our benchmark embraces the dynamic aspects of problematic content detection in two stages. The upstream trainig (II) and evaluation (III) where data is assumed to be coming in a stream, and downstream fewshot evaluation (IV) that measure models’ generalization to novel forms of problematic content. 34 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 3.1 Background 3.1.1 Problematic Content Detection Social media platforms offer individuals means to freely express themselves. However, certain features of social media, such as partial anonymity, which may promote freedom of expression, can also result in dissemination of problematic content. Researchers and social media companies recognize this issue and have developed various strategies to tackle it, including automated systems to identify problematic content. Consequently, multiple definitions of problematic content have been proposed [148], encompassing specific areas like misogyny detection [e.g., 52], to hate speech [e.g., 92] and broader categories such as offensive language detection [e.g., 41]. Ideally, such systems should possess the capability to identify undesirable content irrespective of factors such as timing, specific linguistic form, or the social media platform used. However, recent studies have revealed limited generalizability of such systems, particularly in the context of hate speech detection [197, 152]. Yin and Zubiaga [197] recognized that the scarcity of hate speech in sources poses a challenge to constructing datasets and models. They also acknowledged the difficulty in modeling implicit notions of problematic content. Combining diverse datasets can alleviate both issues by reducing the scarcity of problematic content and enhancing a model’s ability to identify implicit notions through exposure to a broader range of data. 3.1.2 Multitask Learning for Problematic Content In recent years, multitask learning [27] has gained considerable attention as a promising approach for problematic content detection [90, 147, 1, 89, 174]. Multitask learning leverages the inherent relationships and shared characteristics among related tasks (e.g., hate speech, racism, sexism detection etc. in the context problematic content) to improve performance over a model that learns the tasks individually. By jointly training on multiple related tasks, the models can benefit from knowledge transfer and information sharing across different domains. Furthermore, empirical evidence shows the advantage of multitask learning in enhancing generalization and robustness. This advantage could potentially be due to the model’s ability to learn common patterns and effectively differentiate between various forms of harmful language across different tasks [119, 202, 89]. Although multitask learning has demonstrated potential in the field of problematic content detection, it is not exempt from limitations. A significant drawback is the expense involved in retraining the model whenever a new task is introduced to the existing set. As the number of tasks grows, so does the complexity and computational resources needed for retraining. This becomes particularly challenging in the context of a dynamic landscape of problematic content, where new types of hate speech or toxic behavior emerge constantly. Multitask learning encounters various other challenges apart from computational complexity. These challenges include task interference, a phenomenon wherein the acquisition of multiple tasks concurrently can exert a detrimental impact on each other’s learning processes, and catastrophic forgetting, which entails 35 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation the loss of previously acquired knowledge when learning new tasks [155, 97, 192]. 3.1.3 Continual Learning and Few Shot Generalization Continual learning is an approach that has emerged to address challenges like task interference, computational complexity, and catastrophic forgetting faced by multitask learning; instead of simultaneously learning all the tasks, continual learning models learn new tasks over time while maintaining knowledge of previous tasks [155]. This incremental approach allows for efficient adaptation to new tasks while preserving the knowledge acquired from the previous tasks [144]. By leveraging techniques such as parameter isolation, rehearsal, or regularization, continual learning mitigates catastrophic forgetting and ensures that the model retains its proficiency in previously learned tasks [97, 44, 189, 164]. Moreover, the capability to incrementally update the model alleviates the computational burden associated with retraining the entire multitask model every time new tasks are added. As a result, continual learning presents a promising approach to tackle the scalability and adaptability issues inherent in multitask learning. This framework becomes particularly attractive for tasks like hate speech detection, toxicity detection, and similar endeavors within a rapidly changing environment of problematic content. The only work in this space is Qian et al. [151] which applies continual learning to detect hate speech on Twitter. However, their focus is limited to a single definition of hate speech and they analyze a single snapshot of Twitter data. Consequently, their approach does not fully account for the dynamic nature of problematic content across the internet. 3.2 Continual Learning Benchmark for Problematic Content Detection 3.2.1 Problem Formulation Our objective is to develop models that are not only agile in detecting new manifestations of problematic content but are also capable of accumulating knowledge from diverse instances across different time periods and platforms. Such models should possess the ability to rapidly learn and identify new manifestations of problematic content on novel platforms, even when only limited data is available. As time progresses, we anticipate a natural increase in the availability of resources for problematic content detection. Therefore, to encourage building models that leverage this increase in resources, we consider the existing resources as a continuous stream of incoming data. In this context, we make the assumption that there exists a problematic content detection model denoted as f, which undergoes continual learning on a stream of problematic content detection binary classification tasks (T u = [T u 1 , . . . , T u Nu ]) over time. We refer to this set of tasks as upstream tasks. In addition to accumulating knowledge from the stream of tasks, this continual learning model should be able to rapidly generalize its knowledge to numerous related unseen tasks [84]. We formulate this ability as few-shot learning over a separate set of binary classification tasks T d = [T d 1 , . . . , T d Nd ], referred to as downstream tasks. 36 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 3.2: Sequence of upstream tasks in the experiment with chronological task order. Note that datasets are ordered according to the earliest publication date of the data and tasks (i.e., labels) within each dataset are ordered randomly. 3.2.2 Training and Evaluation During the continual learning stage, the model encounters a sequentially ordered list of Nu upstream tasks: [T u 1 , . . . , T u Nu ], where each task has its own distinct training and test sets. To evaluate the few-shot learning capability of the sequentially trained model f, we proceed to adapt it to a collection of Nd few-shot tasks individually represented as T d i . In this scenario, each unseen task is associated with only a small number of training examples. For evaluation purposes, a task is considered “new” if the model hasn’t been exposed to labels from that task. This applies to the ith upstream task (T u i ) in the upstream training process before the model’s upstream training reaches T u i , as well as to all downstream tasks (Figure 3.1). The paucity of problematic content online results in most datasets used in this work being quite unbalanced. In the evaluation of models trained on such unbalanced datasets, Area Under the Curve (AUC) often takes precedence over the F1 score [21]. AUC serves as a measure of a model’s ability to differentiate between positive and negative classes, calculated by assessing the area under the Receiver Operating Characteristic (ROC) curve. Hence, we chose AUC as our primary evaluation metric for both the upstream training and downstream adaptation processes. We acknowledge that the selection of an evaluation metric is not without its controversies. The rationale behind this choice primarily stems from the extensive adoption of the AUC in the problematic content detection literature. In the context of this work, it is important to note that our conclusions would have remained consistent even if we had opted for the F1 score as our primary metric (§B.1.4.) To enable fair comparisons, we used a fixed set of held-out test data for all models. Below we outline the specific measures we employ to characterize the desired attributes of each model. Few-Shot Performance To assess the model’s few-shot generalization ability, we evaluate the continually trained model f on unseen tasks by individually fine-tuning it for each task T v i using a few annotated examples. The few-shot AUC for task T d i is denoted as AUCF S i , and we report the average few-shot AUC across all downstream tasks. 37 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Final Performance To assess the accumulation of knowledge in upstream tasks, we evaluate the AUC of f at the end of the continual learning over upstream tasks. This evaluation allows us to determine the extent to which model f forgets the knowledge pertaining to a specific task once it acquires the ability to solve additional tasks. We report the average AUC over all upstream tasks. Instant Performance To assess the extent of positive transfer among upstream tasks, we evaluate the AUC of f on task T u i right after the model is trained on T u i . We report the average of instant performance across all upstream tasks. 3.2.3 Datasets We have selected datasets for our benchmark based on the following criteria: 1) must be related to problematic content detection, 2) must be in English, and 3) must include a classification task (or a task transformable into classification). We aimed to use datasets that span different sources and time periods, and rely on different definitions of problematic content. Even though we currently focus on one language, the dynamic nature of our formulation easily allows for expansion of this benchmark to other languages (see §3.5 for more details). Our benchmark currently covers data from 8 different sources, namely, Twitter, Reddit, Wikipedia, Gab, Stromfront, chat dialogues, and synthetically generated text. These datasets cover a wide range of definitions of problematic content, from focused definitions such as sexism and misogyny to broader definitions such as toxicity. All datasets in our work are publicly available for research purposes. We do not redistribute these datasets but offer instructions in our repository for downloading and recreating the benchmark from publicly available sources. In addition, we provide license information for all datasets, along with descriptive statistics in §B.1.2. For all datasets, we use the original train/test/dev splits when available, otherwise split the data 80/10/10 randomly. We briefly discuss each dataset below; [U] and [D] denote upstream and downstream datasets respectively. Call Me Sexists, But [CMSB; 109] [D] Consists of 6,325 tweets from two sources: 1) Twitter data that was previously annotated for sexism and racism [190], and 2) Twitter data collected between 2008 and 2019 using the phrase “call me sexist, but.” Each tweet is labeled for sexist content and sexist phrasing, with both being single-choice options. US-election [66] [D] Consists of 3,000 tweets, covering hate speech and offensive language, which were collected during the six weeks prior to the 2020 presidential election, until one week after the election. Each tweet was annotated for being hateful or not, without considering whether the target is a group or an individual. Misogyny Detection [misogyny; 67] [D] Contains 6,567 Reddit Posts from 34 subreddits identified as misogynistic from February to May 2020 annotated with a three level hierarchical taxonomy. We only use the top level annotations which are binary labels for misogynistic content. Contextual Abuse Dataset [CAD; 185] [U] Consists of 25k Reddit posts collected from 16 Subreddits more likely to contain a diverse range of abusive language, and focused on taking the context of the conversations into account. A hierarchical annotation schema is proposed which takes the context of the conversation into 38 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation account; Level 1: abusive, non-abusive, and Level 2: for abusive (i) identity-directed, (ii) affiliation-directed and (iii) person-directed. In our benchmark, we use the three labels from the second level to stress test models’ ability in learning variations of abuse. Ex-Machina: Personal Attacks at Scale [Personal attack; 193] [U] Includes 100k annotated comments from a public dump of Wikipedia from 2004-2015. Annotators were asked to label comments that contain personal attack or harassment in addition to some finer labels about the category of attack or harassment. We included the detecting personal attacks, quoted personal attacks (QA), and personal attack targeted at third party (TPA) as separate tasks in our benchmark. Unhealthy Comment Corpus [UCC; 150] [U] Consists of 44,355 comments collected from the Globe and Mail news site. Every comment is annotated according to a two-level hierarchy; Level 1: healthy or unhealthy. Level 2: binary labels indicating the presence or absence of six specific unhealthy subattributes: (i) hostility, (ii) antagonism, (iii) insults, (iv) provocation, (v) trolling, (vi) dismissiveness, (vii) condescension, (viii) sarcasm, and (ix) generalization. The Gab Hate Corpus [GHC; 92][U] Contains 27,665 posts from Gab.com, spanning January to October, 2018, annotated based on a typology for hate speech derived from definitions across legal precedent. Posts were annotated for Call for Violence (CV), Human degradation (HD), Vulgarity and/or Offensive language (VO), and explicit or implicit language. Stormfront [42] [D] Includes a 10,568 sentences collected from 22 sub-forums of Stormfront.org spanning from 2002 to 2017. Each sentence has been classified as containing hate or not depending on whether they meet the following three premises: “a) deliberate attack, b) directed towards a specific group of people, and c) motivated by aspects of the group’s identity.” Dialogue Safety [126, 195] [D] The Dialogue Safety dataset includes five datasets in the domain of dialogue safety. Three datasets, namely ParlAI single standard, ParlAI single adversarial, and ParlAI multi, are sourced from ParlAI [126]. The other two datasets, BAD2 and BAD4, are from Bot-Adversarial Dialogue [195]. The ParlAI datasets consist of 30,000 samples, while the BAD datasets consist of 5,784 samples. Conversations in the BAD dataset can span up to 14 turns, and following [195], we consider the last two and four utterances of the conversation (BAD2 and BAD4) in our benchmark. All dialogue safety datasets provide toxic or safe labels. Dygen [186] [hate U, rest D] Consists of 41,255 samples dynamically generated using the human-andmodel-in-the-loop setting to train more robust hate detection models. The authors collected four rounds data using Dynabench [94], and annotated each sample hierarchically; Level 1: binary hate/non-hate label, Level 2: subclasses of hate (i.e., derogation, animosity, threatening language, support for hateful entities and dehumanization) and 29 target identities (e.g., immigrant, muslim, woman, etc.). We use Level 1 for upstream training and Level 2 for downstream adaptation. Hatecheck [159] [D] Contains of 3,728 synthetically generated sentences motivated by 29 hate speech detection model functionalities; 18 of these functionalities test for hateful content and cover distinct expressions of hate, and the other 11 functionalities test for non-hateful content and cover contrastive non-hate. 39 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Multitarget-CONAN [CONAN; 51] [D] Consists of 5003 samples of hate speech and counter-narrative pairs targeting different target groups (LGBTQ+, Migrants, Muslims, etc.) created using human-in-theloop methodology, in which the generative language model generates new samples and, after confirmation by expert annotators, would get added to the dataset. In our benchmark we included detection of hate speech toward each target group as a separate task. Civil-comments [47] [U] Includes two million comments from the Civil Comments platform annotated by human raters for various toxic conversational attributes. Each comment has a toxicity label and several additional toxicity subtype attributes which are severe toxicity, obscene, threat, insult, identity attack, sexual explicit. Twitter Abusive [Abusive; 54] [U] Contains 80k tweets from March to April 2017 annotated for multiple fine-grained aspects of abuse, namely, offensiveness, abusiveness, hateful speech, aggression, cyberbullying, and spam. Large-Scale Hate Speech Detection with Cross-Domain Transfer [hate; 177] [U] Includes 100k tweets from 2020 and 2021, each annotated by five annotators for hate speech. Tweets are labeled as hate if “they target, incite violence against, threaten, or call for physical damage for an individual or a group of people because of some identifying trait or characteristic.” 3.3 Models and Methods 3.3.1 Models We represent all tasks in a consistent binary classification format and conduct our experiments using a pretrained language model, specifically BART-Basew [106]. In addition to fine-tuning all the model weights of BART-Base, we also explore two other variations 1) Adapter: We experiment with Adapters [75]. In addition to the classification head, adapter training only trains parameters of Adapters, which are two-layer multilayer perceptrons inserted after each layer of BART. We used a hidden size of 256 for all Adapter layers. 2) BiHNet: The hypernetwork (h) accepts a task representation z as input and generates model parameters for a separate prediction model, denoted as f, in order to address the specific task at hand [84]. 3.3.2 Upstream Training Single Task Learning We finetune a pretrained model on each of the tasks separately. Note that this model completely ignores the sequential order imposed on our upstream tasks and serves as a baseline for evaluating the performance of the base model each tasks without any knowledge transfer. Sequential Finetuning (Vanilla) We also finetune a pretrained model on the sequence of upstream tasks [T u 1 , . . . , T u Nu] without any continual learning algorithms. Previous research suggests that this model will suffer from catastrophic forgetting [155]. Comparing the final performance of this model with a continual learning algorithm will give us a measure of the ability of these algorithms in knowledge accumulation. 40 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Final Instant Fewshot Model AUC ∆ AUC AUC ∆ AUC AUC ∆ AUC Adapter-Single - - 0.879 - 0.806 - BiHNet-Single - - 0.870 - 0.786 - Adapter-Vanilla 0.518 - 0.882 - 0.765 - BiHNet-Vanilla 0.617 - 0.878 - 0.772 - BiHNet-Reg 0.792 +0.174 0.882 +0.003 0.819 +0.047 BiHNet-EWC 0.676 +0.059 0.881 +0.003 0.766 -0.006 Adapter-Multitask 0.873 +0.355 - - 0.816 +0.052 BiHNet-Multitask 0.834 +0.216 - - 0.796 +0.024 Table 3.1: AUC scores for chronological experiment. ∆ values are calculated in comparison to the corresponding Vanilla model. Multitask Learning (MTL) To assess the upper bound of knowledge accumulation on the set of upstream tasks we finetune a pretrained model with multitask learning on all upstream tasks implemented via hard parameter sharing. For Adapter-Multitask models we shared only the adapter parameters and for BiHNet-Multitask models we used a shared BiHNet for all tasks. Continual Learning Finally, we finetune a model continually on a sequence of upstream tasks [T u 1 , . . . , T u Nu]. This model should ideally be able to 1) use knowledge from previous tasks to learn a new upstream task, and 2) retain knowledge of the seen upstream tasks. We experiment with two regularization-based continual learning algorithms: Bi-level Hypernetworks for Adapters with Regularization [BiHNet-Reg: 84]. This model is specifically designed to enhance the generation of adapter weights by optimizing bi-level long and short-)task representations. Its primary objective is to address two important challenges: mitigating catastrophic forgetting and enhancing the overall generalizability of the model. Towards the first challenge, regularization is imposed on the generated adapters. To improve generalization this model learns two representations for each task task; one for high-resource settings and one for few-shot cases. We calculated the long task representation by averaging the embedding of all text samples in the training split of a dataset. short task representations were computed by averaging embeddings of 64 texts sampled from the training set. Elastic Weight Consolidation [EWC: 97]: leverages the principles of Bayesian inference, suggesting a method that selectively slows down learning on the weights important for previous tasks. The model retains old knowledge by assigning a larger penalty to changes in crucial parameters, effectively making them “elastic”. 41 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 3.3.3 Downstream Adaptation An ideal model for problematic content detection should be able to learn its new manifestations quickly. Therefore, we evaluate our models’ ability on learning unseen datasets of problematic content using only a few examples. We report the performances using k = 16 shots. Sensitivity analysis on the number of shots is provided in §3.4.3. 3.4 Experiments Most of the datasets in our benchmark include annotations for various aspects of problematic content (e.g., UCC includes labels for antagonism, insults, etc.). To ensure flexibility, we treated each label as a separate task. This choice is rooted in the likely possibility that we will need to introduce additional labels to the existing set in the future. To accommodate potential future updates to the label taxonomy, it is preferable to have models that can quickly adapt and learn new labels. In order to minimize the exchange of information between the upstream and downstream tasks, across all our datasets with the exception of Dygen, we categorized all tasks within the dataset as either upstream or downstream. Our selection of larger datasets for the upstream tasks was driven by both the data requirements of upstream training and the fact that larger datasets typically encompass a broader range of problematic content. This decision enables the model to accumulate knowledge on general notions of problematic content, which aligns with our objectives. Subsequently, we assigned tasks as downstream that 1) had limited labeled data, and 2) had minimal overlap (e.g., same domain or labels) with the upstream tasks. To assess the efficacy of our proposed framework in practical scenarios, we ran our main experiments by ordering the upstream tasks chronologically. Specifically, we used the earliest publication date of each dataset as the temporal reference point to order the upstream datasets. Note that each dataset consists of multiple labels (i.e., tasks). Since we don’t have any information about the temporal order of tasks within datasets, we chose this order at random. This experiment allowed us to capture the evolution of the research landscape on problematic content detection, thereby providing a more nuanced understanding of the progress of model performance over time. Figure 3.2 shows the order of upstream tasks in this experiment. We experiment with alternative orders of upstream tasks in §3.4.2. 3.4.1 Efficacy of The Continual Learning Framework Baselines: To determine the learning capabilities of each model, we finetune a classifier from each architecture on each task. The average fewshot, final, and instant performance of Adapter-Vanilla, and BiHNetVanilla is presented in the first two rows of table 3.1 respectively. We see the largest gap in performance for these models on the final performance metrics. This can be attributed to BiHNet’s meta learning capabilities. Multitask Upperbound: When there are no adversarial tasks, multitask learning is often used as an 42 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation empirical upper bound for continual learning. The last two rows of table 3.1 show the few shot and final evaluation of multitask models. Note that since these models see all tasks at the same time, instant performance is not defined for them. Does the collection of problematic content tasks help with learning new upstream tasks? In other words, do the models benefit from upstream training when learning a new task with substantial amount of annotated data available? To answer this question, compare the instant performance of a CL model on T u i with a pretrained model finetuned on just T u i . Our results (∆ Instant AUC) show evidence of slight positive transfer, however, the magnitude of this transfer is negligible. Does continual learning improve knowledge retention? The final AUC values, as shown in Table 3.1, indicate the models’ ability to retain knowledge from a sequence of tasks at the end of training. Our results suggest that all continual learning variations outperform naive training. Most notably, BiHNetReg outperforms BiHNet-Vanilla by almost 18% in AUC, indicating its potential to mitigate catastrophic forgetting, while falling only 4% short of the multitask counterpart. Does upstream learning help with generalization to new manifestations of problematic content? Comparing the single-task baselines with continual and multitask learning, our results (Table 3.1) demonstrate a noteworthy improvement in models’ generalization ability as a result of upstream training. Specifically, BiHNet-Reg shows remarkable generalization ability in fewshot settings, outperforming the BiHNet-Vanilla by nearly 5% in AUC. 3.4.2 The Impact of Upstream Task Order Both humans and animals demonstrate enhanced learning abilities when examples are presented in a deliberate sequence [50, 100]. Curriculum learning, a strategy involving the organized presentation of examples or tasks to expedite learning, has been proven to significantly influence the performance of neural models [13]. In the context of our proposed framework, a crucial question arises: to what extent does the sequence of upstream tasks impact the performance of different strategies on both upstream and downstream tasks? Furthermore, can we find the optimal ordering for upstream tasks? While the exhaustive exploration of these questions is beyond the scope of the current work, we investigate two alternative orders of upstream tasks. We emphasize that the modular design of our benchmarks allows for the effortless reordering of upstream tasks and facilitates seamless experimentation with curriculum learning. Specifically, we first modify our experiment in section 3.3.3 by keeping the upstream dataset order intact but modifying the order of tasks within each dataset. Additionally, we present results with a completely random order of tasks. Overall, these experiments show that BiHNet-Reg, the top-performing model in our main experiment, is also the least sensitive to task order, in comparison to other approaches. These results suggest BiHNet-Reg is a robust architecture for practical settings were the sequence of upstream tasks frequently evolves. 43 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation threat hate dismissi ve toxicity generali sation unfair identity attack insult obscene hateful third party attack vulgarity healthy abusive attack hateful sarcastic generalisa tion unfair human degradtion condesending hostile antagonize identity directed abuse person direct abuse affiliation Direct abuse Civil comments (Dixon et al., 2018) UCC (Price et al., 2020) Hate (Toraman et al., 2022) Dygen (Vidgen et al., 2021b) Personal attack (Wulczyn et al., 2017) CAD (Vidgen et al., 2021a) GHC (Kennedy et al., 2022) Abusive (Founta et al., 2018) Shuffled Chrono recipient attack Offensive Figure 3.3: Shuffled sequence of tasks for the chronological experiment. Chronological Upstream Datasets with Shuffled Tasks In our chronological experiment, we initially assigned tasks within each dataset in a random order, as we lacked information regarding their precedence. To gauge the potential influence of the selected task sequence on our results, we train all model variations again but use an alternative random task order reshuffling while maintaining the dataset order intact. The sequence of upstream tasks in this experiment is illustrated in figure 3.3. Our results reflect a similar pattern as the initial experiment (Table 4.2) Specifically, the few-shot AUC of BiHNet-Reg improves by nearly 2% compared to BiHNet-Vanilla, falling only 1.2% short of BiHnetAdapter-Multitask. In terms of the final AUC, once again, BiHnet-Reg outperforms all sequential fine-tuning variations, and the instant AUC of all models falls within a close range. Overall, this experiment suggests that our proposed approach is robust to task perturbations within datasets. In other words, while the order of tasks within a dataset affects the resulting model’s performance, the order of performance among different algorithms remains consistent. Figure 3.4: Random sequence of upstream tasks. 44 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Method Upstream Downstream Final Instant Few-shot ∆ Final ∆ Instant ∆ Few-shot Chronological Adapter-Vanilla 0.7648 0.8844 0.7568 — — — BiHNet-Vanilla 0.7594 0.8815 0.7865 -0.0054 -0.0031 +0.0297 BiHNet-Reg 0.7963 0.8830 0.8043 +0.0315 -0.0014 +0.0475 BiHNet-EWC 0.7513 0.8783 0.7702 -0.0135 -0.0061 +0.0134 Random Order Adapter-Vanilla 0.6784 0.8859 0.8321 — — — BiHNet-Vanilla 0.7115 0.8838 0.8146 +0.0331 -0.0021 -0.0175 BiHNet-Reg 0.7859 0.8846 0.8087 +0.1075 -0.0013 -0.0234 BiHNet-EWC 0.6571 0.8863 0.8190 -0.0213 +0.0004 -0.0131 Adapter-Multitask 0.8752 — 0.8531 —- — — BiHNet-Multitask 0.8321 —- 0.8215 —- — — Table 3.2: Results in AUC for experiments with alternative upstream task order. Rows marked with “Chronological” show the results of experiments with chronologically ordered datasets but shuffled task orders within a dataset. Rows marked with “Random Order” show the results on complete random order of upstream tasks. The ∆ values are computed in comparison to Adapter-Vanilla in each experiment. Notably, BiHNet+Reg demonstrates very stable performance regardless of the upstream task order. Random Upstream Task Order To show the efficacy of our proposed continual learning approach in adapting to any scenario, we randomly ordered the upstream tasks. Figure 3.4 shows upstream task sequence used in our experiments. Note that, we kept the dataset splits (i.e. train/dev/test) consistent with chronological experiment. This approach ensures that our comparison remains fair and valid, allowing for a meaningful assessment of the model’s performance under the altered evaluation conditions. Overall, we observe similar performance patterns among the different algorithms, but the differences in performance are now less pronounced (Table 4.2). Below we discuss the results in detail; Baselines: Interestingly, in this experiment, the Adapter-Vanilla baseline performs exceptionally well on downstream tasks despite achieving lower final performance. This could be attributed to the order of tasks, specifically the tasks at the end of the upstream. While this result might be favorable, the Adapter-Vanilla is not well-suited for practical settings where the of upstream tasks constantly evolve. This is evident from the high variations in the final and few-shot performance of the model across experiments. Multitask Upperbound: The final and few-shot evaluation results for multitask models are displayed in the last two rows of table 4.2. It is important to note that these models, having been exposed to all tasks simultaneously, do not have an instant performance metric defined for them. Does the collection of problematic content tasks help with learning new upstream tasks? To address this inquiry, we can assess the immediate performance of a continual learning (CL) model when applied to [T u 1 , T u 2 , ..., T u i ] and compare it to a pretrained model fine-tuned exclusively on T u i . Our results 45 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 3.5: Few-shot performance (AUC) based on number of shots (K) (∆ Instance) show evidence of slight positive transfer, however, the magnitude of this transfer is negligible. Does continual learning improve knowledge retention? The final AUC values, as shown in the first column of Table 4.2, indicate the models’ ability to retain knowledge from a sequence of tasks at the end of training. Our results suggest that continual learning (BiHNet-Reg) outperforms naive training (BiHNetVanilla) by at least 0.07 in AUC, indicating its potential to mitigate catastrophic forgetting. However, BiHNet-Reg falls 0.04 short of the multitask counterpart. Further investigation is needed to understand this difference. Does upstream learning help generalize new manifestations of problematic content? Comparing the single-task baselines with continual and multitask learning, our results demonstrate a noteworthy improvement in models’ generalization ability due to upstream training. 3.4.3 The Impact of Number of Shots in Downstream Adaptation We performed a sensitivity analysis on the number of shots to examine how it affects our models. Specifically, we conducted few-shot training using 8, 16, and 32 shots. You can find the corresponding results in Figure 3.5. Our results show a consistent pattern; all models improve as the number of shots increases and the order between models stays the same. Interestingly, there is only one exception. BiHNet-Reg outperforms BiHNet-Vanilla with more shots. We leave further investigation of this effect is left for future work. 46 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 3.4.4 Qualitative Analysis We provide qualitative examples of texts correctly classified by the BIHNet-Reg and misclassified by adaptervanilla below. Examples from CMSB dataset with sexism present. • This is the exact reason why Women shouldn’t be involved. Not sexist. But situations like this will always be blown out of proportion. • I’m not sexist, but women are inferior. proving that you can still be an idiot regardless of your “high IQ” Examples from CMSB dataset labeled as not sexist. • I’m not sensitive... But if in this modern era, a good adult is judged as one that pays the bills? A good adult is also one that can cook! • I do not like dumb refs for football.... • Advice for adults: Think like an adult ”act” like a pro • I almost hate every song by any Southern country artist As demonstrated in the first two examples, BiHNet-Reg is able to correctly classify instances with a direct mention of “not sexist” but the vanilla model fails to do so. In the later examples, the vanilla model misclassifies texts that mention any gender stereotypes despite the fact that the mentions are not used in the context of gender. 3.5 Discussion In conclusion, we propose a continual learning benchmark and framework for detecting problematic content, that realizes its dynamic and adaptable nature. We define essential characteristics of an ideal model and create a continual learning benchmark and evaluation metrics to capture the variability in problematic content. Our benchmark has two key components: First, an upstream sequence of problematic tasks over which we measure a model’s ability in accumulating knowledge, and second, a separate set of downstream few-shot tasks on which we gauge a model’s agility in learning new manifestations of problematic content. Our experiments clearly demonstrate the effectiveness of this formulation, particularly in its ability to adapt to new types of problematic content. To keep the benchmark up-to-date, we have designed it with continuous updates in mind; tasks can be effortlessly added, removed, or repositioned. We encourage the community to actively contribute to and expand this benchmark, as it serves as a collaborative platform for advancements in the field. 47 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation We emphasize that this is only one experimental scenario for dividing the tasks into upstream and downstream. Our benchmark’s modular design allows for easy experimentation with other scenarios allowing researchers to further study various continual learning setups and evaluate a variety of continual learning algorithms. The social science examination of the evolution of problematic content carries its own importance and follows a dedicated line of inquiry. Due to space constraints, we have not provided an exhaustive discussion of this subject. We recommend referring to [98, 9] for a comprehensive overview of this area. We acknowledge that the experiments in our paper are limited to the continual learning methods employed. We encourage future researchers to explore other continual learning approaches. The benchmark under discussion is currently designed only for English language content, neglecting the challenges posed by problematic content in other languages and cultures. Our design, however, allows for an easy expansion of the benchmark to include other languages. We have outlined the procedure to expand the benchmark on the accompanying repository and encourage the community to contribute to the benchmark. Though it presents a new measure of progress and baseline results, further investigations and extensive experimentation are needed to fully evaluate the potential of continual learning in detecting evolving problematic content. The study’s approach, predominantly using majority label datasets, potentially leads to bias and overgeneralization in detecting problematic content, given the inherent subjectivity of such content influenced by cultural norms, individual sensitivities, and societal changes over time. The effectiveness of this benchmark could significantly vary due to the diversity of sources and annotation schemas, potentially leading to cultural bias and an overreliance on AI for content detection, thereby neglecting the importance of nuanced human moderation. Future work can explore the potential considering this subjectivity under our continual learning framework. Moreover, the benchmark opens possibilities for misuse, including training models to generate problematic content or designing adversarial attacks, where malicious actors can exploit the understanding of detection systems to craft content that evades detection. Datasets used in this benchmark may have a high prevalence of problematic content targeting certain social groups. Hence, models trained on these datasets could produce unfair outcomes, such as higher false positive rates for the aforementioned groups [47, 191]. Recently, various methods have been proposed to mitigate these biases, such as those by Mostafazadeh Davani et al. [129], Kennedy et al. [93], Omrani et al. [138]. Future research could examine the extent of biases’ influence on the model within our framework and the effectiveness of the mentioned techniques in mitigating them. Moreover, some datasets may hold personally identifiable information or data from which individual details can be inferred. Since we are not redistributing any of the datasts, to address this concern, we suggest applying Google’s DLP, a tool designed to scan and classify sensitive data, to the datasets. Another concern in research on problematic content detection is the potential misuse for censorship. However, we emphasize that, in contrast to private methods concealed behind corporate doors, an open-access or academic approach to detecting problematic content fosters transparency. This allows the public to understand and critique the detection criteria. Such transparency ensures accountability, given that academic methods frequently undergo peer review and public scrutiny, thereby addressing biases and mistakes. 48 Chapter 4: Towards Multilingual Culturally Sensitive Models Building upon the success of my continual learning framework and benchmark in capturing the dynamic nature of problematic content in English, this chapter addresses the next pivotal phase in realizing the overarching goal of this thesis: the development of inclusive, adaptable, and robust models for detecting a wider range of problematic content. In order to meet the crucial requirement of inclusivity, I extend the framework introduced in chapter 3 to multilingual settings. Through this expansion, my aim is to encompass the diverse nuances of problematic content across different languages, thereby promoting a more comprehensive and equitable approach to its identification. This transition to a multilingual setting not only widens the framework’s scope but also demands careful consideration of the unique challenges and opportunities that accompany this shift. Importantly, recognizing the significance of culture in shaping language use and interpretation, this chapter emphasizes the integration of cultural factors into the multilingual framework. Culture plays a pivotal role in influencing the perception of what constitutes problematic content, as societal norms, values, and taboos vary significantly across different cultural contexts. By incorporating cultural considerations into the training pipeline, the models can better account for context-specific nuances and sensitivities, ultimately enhancing their effectiveness in identifying and addressing problematic content in diverse cultural settings. By embracing cultural inclusivity, we not only enhance the accuracy and relevance of the detection models but also contribute to fostering more inclusive and respectful online environments across linguistic and cultural boundaries. 4.1 Building a Multilingual Benchmark Developing cross-lingual methodologies for detecting problematic content necessitates a multilingual benchmark that enables the testing of these approaches. In order to achieve this, I expanded the previous benchmark by incorporating datasets from 15 different languages into my analysis. Each dataset had to satisfy two criteria to be included in this expansion: 1) it must contain labels related to offensive language, and 2) it must be in a non-English language. Here, I will review all the datasets used in this chapter. It is essential to 49 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation emphasize that all mentioned datasets are publicly available and have been specifically curated to facilitate research on hate speech and offensive language detection, aligning with my usecase in this thesis. Albanian Nurce et al. [137] introduce a dataset consisting of 11,874 posts collected from Instagram and YouTube. Four annotators have annotated the posts using hierarchical annotation proposed in [198]. In this annotation three subtasks are defined as distinguishing between: 1) offensive and non-offensive, 2) targeted or untargeted offense, 3) individual, group, or other targets. In this study we use data from subtask 1. Danish Sigurbergsson and Derczynski [166] introduce a dataset consisting of 800 Facebook posts and 2,800 Reddit posts and their respective comments. Annotation is done based on subtask of [198] and one binary label indicating offensiveness is provided. English de Gibert et al. [43] introduce a dataset consisting of 10,568 sentences sourced from 22 sub-forums of Stormfront.org, covering the period from 2002 to 2017. Each sentence is categorized based on whether it fulfills three criteria: a) deliberate attack, b) directed towards a specific group of people, and c) motivated by aspects of the group’s identity. Estonian [165] introduce a dataset consisting of 31.5M comments on news articles from Eesti Ekspress and labels to determine why deleted comments were considered inappropriate. The eight defined labels are as follows: 1) Disallowed content, 2) Threats, 3) Hate Speech, 4) Obscenity, 5) Deception and trolling, 6) Vulgarity, 7) Language, and 8) abuse. We take a comment as offensive if any of the aforementioned categories are present. German Assenmacher et al. [7] introduce a dataset consisting of 85,000 comments from the German newspaper Rheinische Post and the moderator’s binary decision of abusiveness. The data is further annotated using the following fine-grained categories: 1) sexism, 2) racism, 3) threats, 4) insults, 5) profane, 6) meta/organizational, and 7) advertisement. In this work we aggregate the first five labels and create a new label for offensiveness. Greek Pitenis et al. [146] introduce the Offensive Greek Tweet Dataset (OGTD) containing 4,779 tweets collected between May and June 2019. [198] guidelines and schema for subtask a is used and each tweet is labeled as offensive or not-offensive. Italian Bosco et al. [19] introduce a dataset consisting of 17,567 comments on 99 Facebook posts and 6,928 tweets. The task defined on these two datasets is a binary classification for detecting hate speech. Latvian Pollak et al. [149] provide EMBEDDIA, a set of tools, datasets, and challenges for European languages. One of their datasets is 12M comments on Latvian news from Ekspress media group collected from 2015 to 2019. The labels indicate whether the comment was deleted or not from the website. Similar to Estonian, 50 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation comments are often in Russian as well. Portuguese Leite et al. [103] introduce a dataset consisting of 21K tweets collected from July to August 2019. The data is annotated for hate speech detection. Six fine-grained labels are also provided to indicate the type of hate speech. These labels include 1) LGBTQ+ phobia, 2) Insult, 3) Xenophobia, 4) Misogyny, 5) Obscene, and 6) Racism. In this work we aggregate all labels and create a new label for offensiveness. Russian Gorbunova [65] introduce a dataset consisting of 3,000 comments Russian social network VKontakte and was collected to evaluate existing classifiers on distorted words. Two binary labels are assigned to each comment to indicate toxicity and distortion. Turkish C¸ ¨oltekin [35] introduce a dataset consisting of 40,000 tweets collected from March 2018 to September 2019 with a gap of two weeks during November 2018. The tweets are then labeled using subtask a of the hierarchical labeling introduced in [198]. Surzhyk Andrusyak et al. [5] introduce a dataset consisting of 2,000 YouTube comments in Surzhyk which is spoken in Russia and Ukraine. A binary label is then assigned to each comment to indicate if the comments is abusive or not. Chinese Deng et al. [45] introduce a dataset consisting of of 37,480 posts from Zhiho and Weibo social media platforms. The data is annotated using a binary label to indicate offensiveness and a categorical label named topic that takes values of race, gender, and region. The topic label shows what topic the offender targeted. Hindi Bhardwaj et al. [14] provide 8,200 posts collected from Twitter, Facebook, and WhatsApp. The posts are then categorized into five categories: 1) fake, 2) hate, 3) offense, 4) defame, and 5) non-hostile. Arabic The dataset provided by Mulki et al. [130] consists of 6,000 tweets collected from March 2018 to February 2019. Each tweet has been assigned to one of the three categories: 1) Normal, 2) Hate, and 3) Abusive. We treat the tweets in the normal category as non-offensive and assign an offensive label to Tweets in the hate and abusive categories. 4.2 Multilingual Continual Learning Baselines It’s crucial to emphasize that the continual learning approach discussed in the last chapter doesn’t involve any language-specific processing, which means it has the potential to be extended to other languages. However, effectively leveraging patterns across languages requires the development of tailored methods designed explicitly for this purpose. In this section, I establish the extent to which the approach proposed in the 51 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Language Dataset Text Source Task/Label Total #/Positive # Albanian [137] Instagram & YouTube Offensive Language/subtask a 1,568/11,874 (13.20%) Danish [166] Facebook & Reddit Offensive Language/label 384/2,960(12.97%) English [43] Stormfront Offensive Language and Hate Speech 1119/9916 (11.29%) Estonian [165] News Comments Deleted Comment/infringed on rule 126,386/1.5M (8.02%) German [7] News Comments Offensive Language/aggregated labels 23044/85,000 (27.11%) Greek [146] Twitter Offensive Language/subtask a 2,486/8,743 (28.43%) Italian [19] Facebook & Twitter Hate Speech/hate 2,764/6,000/ (46.06%) Latvian [149] News Comments Deleted Comment/is enabled 485,679/3,379,490 (14.37%) Portuguese [103] Tweeter Hate Speech/aggregated labels 9,255/21,000 (44.07%) Russian [65] Vkontakte Social Network Toxicity/toxicity 456/2,400 (19.00%) Turkish [35] Twitter Offensive Language/subtask a 6,046/31,277 (19.33%) Surzhyk [5] YouTube Abusive Language/abusive 654/2,000 (32.70%) Chinese [45] Zhiho & Sina Weibo Offensive Language/label 12,723/25,726 (49.45%) Hindi [14] Twitter & Facebook & WhatsApp Hate Speech/Labels Set 2,678/5,728 (46.75%) Arabic [130] Twitter Hate Speech/Class 1,791/4,676 (38.30%) Table 4.1: Source, task, statistics, and reference of datasets for multilingual benchmark. previous chapter can readily generalize to multilingual settings. 4.2.1 Experimental Setup To enable the approaches of chapter 3 to learn and predict offensive language across different languages, we swap changed the pretrained-language model from RoBERTa-base to XLM-R [36]. XLM-R extends the RoBERTa architecture to handle multiple languages effectively. It incorporates cross-lingual pretraining objectives, enabling it to learn representations that are robust across different languages. By leveraging largescale multilingual corpora, XLM-R captures language-agnostic features, facilitating the transfer of knowledge across diverse linguistic contexts. We ordered the tasks randomly as follows, Arabic → Estonian → Hindi → German → Albanian → Russian → English → Greek → Danish → Turkish → Latvian → Portuguese → Italian → Ukranian → Chinese. We particularly note a stark difference between the final and instant AUC scores, with instant scores notably lower than final scores. This disparity contrasts with the patterns observed in English (see Table 3.1). Such an observation suggests that the transfer between multilingual tasks can often yield negative outcomes, necessitating a more in-depth analysis into the relationship between detecting offensive langauge in different languages and cultures. Method Final Instant BiHNet-Reg 0.6916 0.5880 BiHNet-EWC 0.4588 0.5539 Table 4.2: Instant and final AUC for the multilingual benchmark with random order. 52 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 4.3 Leveraging Cultural Diversity in Offensive Language Detection In recent years, the escalating prevalence of offensive language on prominent social media platforms such as Facebook and Twitter has emerged as a significant and pressing concern. The landscape of online discourse has been further complicated with the introduction of content generated by language models [183, 9]. Within the NLP community, extensive research efforts have been dedicated to developing resources and methodologies for detecting offensive content [See 197, for a review]. Initial endeavors were predominantly concentrated on monolingual settings, with the majority of the research focusing on the English language [184]. However, recently, the trajectory of research has shifted towards addressing the challenge of offensive language detection in other languages or in multilingual settings [4]. This shift, however, is hindered by the constrained availability of labeled data and the considerable variability in what constitutes offensive language across diverse cultures and languages [157]. In numerous NLP tasks, Cross-Lingual Transfer Learning (CLTL) has emerged as a promising avenue for addressing challenges related to data scarcity. CLTL leverages domain knowledge from high-resource languages to benefit low-resource languages. However, the application of many CLTL methods to offensive language detection has proven less successful [136]. The intricate linguistic structures and cultural variations across languages pose significant challenges for CLTL [83]. Davani et al. [40] emphasize the pivotal role of cultural and psychological factors in determining what is deemed offensive. Despite this recognition, a considerable portion of recent studies overlook the significance of cultural context and advocate a one-sizefits-all solution, using English data to enhance the performance of offensive language classifiers in low-resource languages [156]. Consequently, as demonstrated in recent findings by Lee et al. [102], hate speech classifiers are culturally insensitive. In this study, I systematically investigate the influence of linguistic and cultural similarities on the crosslingual transferability of hate speech and offensive language detection. Contrary to previous suggestions, I observe that training on English corpora before delving into offensive language detection in a different target language leads to diminished performance in certain cases (section 4.6). Furthermore, I find that including culturally diverse datasets in the first stage of CLTL significantly improves the performance of target languages in low-resource settings (section 4.7). Based on my findings, I advocate for CLTL methods that leverage cultural diversity. My results suggest that the model’s exposure to culturally diverse datasets not only broadens the model’s cultural repertoire but also increases its ability to precisely identify offensive content across different languages. My detailed analysis of cross-lingual transfer learning across 15 languages, and 105 language pairs, aims to disentangle the respective roles of linguistic and cultural similarities between datasets on cross-lingual transferability among them. This investigation underscores the necessity of moving beyond English-centric approaches and integrating cultural context into NLP solutions for offensive language detection. 53 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 4.1: The relationship between cultural differences, as measured by the WEIRDness distance between La and Lt, and ∆(La, Lt). The regression line, derived from Equation 4.1, indicates that the WEIREDness distance predicts CLTL performance gains (β = 0.3, p = 0.02). 4.4 Background 4.4.1 Cross lingual Transfer Learning The primary objective in CLTL for offensive language detection is to leverage knowledge from a language with existing resources (i.e., the auxiliary language) to enhance the effectiveness of offensive language detection in a language with limited resources (i.e., the target language). Various methods have been proposed for CLTL of offensive language detection. These approaches can be broadly categorized as instance transfer, feature transfer, and parameter transfer [83]. Instance transfer involves approaches that transfer either the labels (e.g., via label projection) or the text (e.g., via translation) to the new language. Translation approaches, however, may be prone to errors, possibly neglecting cultural nuances and resulting in translations inconsistent with the original language [39]. Feature transfer methods focus on using latent representations of texts (e.g., multilingual embeddings) to transfer knowledge from the source to the target language. However, Nozza [136] demonstrated that multilingual embeddings exhibit poor generalization across languages when lacking training data in the target language. Finally, parameter transfer approaches use the parameters of a model trained on an auxiliary language to enhance performance on the target language. An essential element in parameter transfer approaches is the choice of target and auxiliary languages. Since cultural factors can influence language use, connotations, and perceptions of offensiveness, it becomes crucial to systematically investigate their impact on CLTL approaches. 54 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 4.4.2 Culture, Language, and Offensiveness Culture broadly encompasses a range of “good-enough” solutions that each society has developed to address survival problems [142], often operationalized as causally distributed patterns of mental representations across a population [10]. Cultural solutions manifest in a diverse array of beliefs, values, norms, and practices [20]. One of the dimensions of cultural differences is individualism vs. collectivism [179]. Individualistic cultures emphasize values of autonomy, distinction, and the pursuit of uniqueness. In contrast, collectivistic cultures prioritize unity, conformity, communal harmony, and mutual responsibility [143, 120]. A critical domain where individualistic and collectivistic cultures diverge is in perceptions of offensiveness, including the nature of offenses, the intensity of emotional reactions they provoke, and views on suitable retribution [117]. Collectivistic cultures perceive offenses against communal entities such as national symbols, religious beliefs, or family honor as grave threats to social unity [96]. Conversely, in individualistic cultures, offenses against an individual’s achievements, professional reputation [68], or personal identity, like gender or sexual orientation, are taken with equal gravity. The individualism vs. collectivism difference, while providing valuable insights into the cultural psychology of offense, fails to account for other dimensions of cultural differences such as a society’s tolerance for norm violations, known as the tightness–looseness dimension [59], which influences how people perceive and react to offensive language. In recent years, cultural psychologists have introduced a new comprehensive index for quantifying cultural differences, known as the WEIRDness score [131]. “WEIRD”, in this context, stands for “Western, Educated, Industrialized, Rich, and Democratic” [73]. This index is a composite score derived from several measures of cultural differences, including Hofstede’s [74] cultural dimensions (which encompass, among others, individualism-collectivism scores), the tightness–looseness, dimension, Schwartz’s values [163], and a range of other psychological and behavioral measures. The WEIRDness score is a quantitative measure designed to assess the cultural distance of a country to the U.S., which is considered a quintessential WEIRD nation (For a more in-depth discussion refer to Section C.2). Countries that align closely with the characteristics of the U.S. are deemed more WEIRD, while those diverging from the US traits are labeled as more non-WEIRD. Past cross-cultural evidence documents how WEIRDness can reliably predict a multitude of psychological variances across nations such as differences in moral values and the perception and interpretation of hate content among different populations [73, 8]. Previous studies have indicated that individuals from WEIRD countries tend to classify fewer items as offensive, particularly when China is excluded from the analysis [40]. Linguistic similarity is another crucial factor in understanding cross-lingual transferability of offensive language detection due to its potential impact on the effectiveness of multilingual models. Languages vary not only in vocabulary but also in syntax, semantics, and phonetics, and various approaches have been proposed to quantify the similarity between languages [176, 116, 63]. To measure linguistic similarity, we adopt a data-driven approach for language comparison, emphasizing the identification of cognates through 55 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation computational analysis of phonetic data, especially consonants [49]. This method applies phonological rules to systematically identify potential cognates. An advanced scoring system evaluates the similarity between languages at multiple levels, from phonetics to broader structures. Finally, statistical analysis of cognate scores ensures the validity and reliability of the language-relatedness findings, distinguishing true linguistic connections from coincidental similarities. For a comparative analysis between available indices, and the rationale behind our choice of linguistic similarity, see section C.3. 4.5 Experimental Setup Our goal is to investigate how linguistic and cultural differences affect cross-lingual transferability of offensive language detection. Let Mθ denote a pretrained multilingual language model M parameterized by θ and let La and Lt denote auxiliary and target languages, respectively. Let fLt and fLa denote the offensive language detection models that were initialized with Mθ and have only been trained on data from the target and auxiliary languages. Furthermore, let fLa→Lt denote the cross-lingual transfer model that has two training stages: In the first stage, Mθ has been trained on the auxiliary language to get fLa . Then in the second stage, fLa has been fine-tuned on data from the target language. The overall goal in CLTL is to maximize the performance gains resulting from the first stage of training formally defined as ∆(La → Lt) = AUC(fLa→Lt ) − AUC(fLa ) where AUC(.) is used to denote the area under the operating characteristic curve of a model on the test set form Lt. We use ∆(La, Lt) to denote the average of ∆(La → Lt) and ∆(Lt → La). In Section 4.6, we assess if English (or any auxiliary language) universally guarantees positive transfer (∆(La → Lt) > 0). Subsequently, in Section 4.7 to quantify how cultural and linguistic differences between the La and Lt influence ∆(La, Lt), we rely on the following regression: ∆(La, Lt) = β0 + β1 × ∆WEIRDness(La, Lt) + β2 × ∆Language(La, Lt) + ϵ (4.1) where ∆WEIRDness(La, Lt) denotes the difference in WEIRDness score of La and Lt [131], and ∆Language(La, Lt) denotes the linguistic distance [49]. We conduct our experiments on 15 languages, namely, Albanian, Danish, English, Estonian, German, Greek, Italian, Latvian, Portuguese, Russian, Turkish, Surzhyk, Chinese, Hindi, and Arabic. More information on the datasets used in this work can be found in section 4.1. We split each dataset into a 80/10/10 train, dev, and test split. To control for the differences in dataset size in different languages, we take a stratified sample of a fixed number of instances (n = 1000) from each language for the training set. Given that the language model needs to be able to handle data from multiple languages we used XLM-r [36] and trained all model parameters for 10 epochs. 56 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 4.2: Distribution of ∆(La → Lt) by auxiliary language La. Most languages exhibit both positive and negative impacts on CLTL, underscoring the significance of considering cultural factors when choosing La. 4.6 English Data Does Not Always Help Recognizing the pivotal role of cultural factors in shaping perceptions of offensiveness, we reevaluate the one-size-fits-all approach proposed by previous researchers [156] on a diverse set of languages and cultural backgrounds. Specifically, we test the assumption that employing English as the auxiliary language consistently enhances the performance of the target language (Lt). Our empirical investigation reveals that contrary to this assumption, using English as the auxiliary language results in performance degradation (∆(English → Lt) < 0) in 40% of the cases. Specifically, we observe diminished performance for Russian, Portuguese, Hindi, Estonian, Latvian, and Italian (section C.4). As shown in Figure 4.2 most languages exhibit diverse effects, encompassing both positive and negative impacts on CLTL. This analysis underscores the limitations of English-centric approaches, highlighting the potential of considering cultural factors in offensive language detection. 4.7 Cultural Diversity Improves Models To quantify the impact of cultural and linguistic factors on CLTL gains, we conducted a linear regression analysis predicting ∆(La, Lt) based on language difference and WEIRDness difference (see Equation 4.1). We find evidence that WEIRDness difference significantly predicts CLTL performance gains (β = 0.3, p 57 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation = 0.02) even after controlling for linguistic factors. Specifically, for each unit increase in the WEIRDness difference, there is an expected increase of 0.3 units in ∆(La, Lt). However, language similarity was not a significant predictor (p = 0.21) of ∆(La, Lt). In the model the assumptions of linearity, independence, and normality were met, with a residual standard error of 0.05. Our results imply that training models across languages from diverse cultural contexts could serve as a potential solution to building culturally sensitive models capable of capturing a more accurate reflection of cultural nuances. 4.8 Discussion This chapter underscores the crucial role of cultural diversity in cross-lingual approaches to offensive language detection. We conducted a systematic examination of the influence of both cultural and linguistic factors on cross-lingual transferability across 15 languages. Interestingly, we find that linguistic proximity does not impact transferability. However, transfer significantly improves when using culturally diverse language pairs. This emphasizes the importance of cultural context in offensive language detection and exposes the shortcomings of relying on English-centric approaches. The findings of chapter are constrained by the specific languages and datasets chosen for our analysis. The language models utilized in this chapter introduce limitations. Different language models may yield distinct results due to variations in architecture, training data, and underlying algorithms. Consequently, the findings should be interpreted within the context of the chosen language models. The chapter is based on data available up February 2024. Changes in language usage, cultural trends, or advancements in language models beyond this date are not considered. Even though this chapter highlights the significance of incorporating cultural diversity in CLTL for offensive language detection, I do not endorse an approach that disregards universal ethical standards. Recognizing that certain expressions of hate, such as calls for genocide, are universally unacceptable based on the Declaration of Human Rights, my findings advocate for a balanced perspective that respects cultural nuances while upholding global ethics. Acknowledging these limitations is crucial for a nuanced interpretation of this chapter’s findings and encourages future research to address these constraints for a more comprehensive understanding of the broader linguistic landscape. 58 Chapter 5: Understanding the Negative Effects of Noisy Labels In recent years, natural language processing has been revolutionized by large pre-trained language models such as Llama [178], GPT-4 [2], and Gemini [60]. However, the massive parameter size of these models, often in the hundreds of millions or billions, presents challenges for fine-tuning and deployment. Parameter Efficient fine-tuning (PEFT) Methods like Low-Rank Adaptation [LoRA; 78] have emerged as an efficient approach to adapt only a small subset of a large model’s parameters for a downstream task [58, 71]. While computationally appealing, it remains unclear whether these parameter-efficient methods exhibit the same characteristics and capabilities as full fine-tuning, especially in terms of robustness to label noise. Machine learning datasets often contain label noise, which occurs when assigned labels to a data point differ from the ground truth. In fact, real-world datasets have been estimated to contain anywhere from 8.0% to 38.5% of noisy labels [169, 101]. Recent research has highlighted the remarkable robustness of fine-tuned language models to label noise. For example, T¨anzer et al. [175] find that pre-trained models such as BERT are more robust to noise. However, this generalization capacity comes at the cost of lower F1 scores in the face of extreme class imbalances when no noise is present. Zhu et al. [203] demonstrate that existing noise handling methods do not improve the peak performance of BERT models. Importantly, prior investigations primarily focus on assessing the impact of label noise on fully fine-tuned models within balanced datasets. In this chapter, my primary focus is on assessing whether LoRA tuning maintains robustness to noise inherent in the original model through fine-tuning. Additionally, I delve into the practical implications of both LoRA and fine-tuning methodologies by exploring scenarios involving imbalanced training data. Through comprehensive experimentation across datasets with varying noise levels and imbalances, my results demonstrate that LoRA tuning effectively preserves robustness against random label noise, matching the robustness observed in models subjected to full fine-tuning. This underscores LoRA’s parameter efficiency comes without compromising model robustness. Notably, unlike full fine-tuning, which tends to overfit noisy samples along with clean ones, LoRA’s training performance stabilizes at lower values as noise intensity increases. I meticulously monitor the influence of noisy and clean samples during training, revealing that LoRA predominantly learns from clean samples. Furthermore, my analysis of learning and forgetting events highlights LoRA’s superior ability to retain learned information amidst increasing noise levels compared to 59 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.1: Comparison of learning dynamics for LoRA (left) and fine-tuning (right) on a balanced subset of the IMDB dataset. Both LoRA and fine-tuning exhibit robustness to noise, achieving high validation performances. However, LoRA demonstrates a distinctive resistance to overfitting the noise. full fine-tuning. We also scrutinize the model’s resilience under substantial label imbalance and observe a marked decline in validation performance as data imbalance worsens, with this decline initiating at lower noise levels, particularly when the imbalance is more pronounced. Overall, this chapter paves the way for understanding LoRA’s potential in real-world scenarios with noise and imbalance. Our results demonstrate that LoRA tuning emerges as a robust and efficient contender for fine-tuning even in the presence of noisy labels. It retains the impressive noise resistance of its full-finetuning counterparts while showcasing unique advantages. Notably, LoRA learns primarily from clean data, exhibiting lower forgetting rates than fine-tuning under noise. 5.1 Background 5.1.1 Sources of Label Noise Label noise is common in tasks involving human experts due to various factors ranging from insufficient evidence to perceptual errors [122]. Fr´enay and Verleysen [57] categorize potential sources for label noise into four categories. Firstly, the information provided to annotators may lack sufficient detail, leading to unreliable labeling. For example, the annotation manual may not be elaborate or prescriptive enough [158]. Secondly, errors may also stem from non-experts often hired through crowdsourcing platforms to reduce annotation costs. Thirdly, many tasks, such as offensive language detection, are inherently subjective, where a single ground truth does not exist, leading to considerable variation in labels assigned by individual annotators. Lastly, label noise may occur due to data encoding issues (e.g., a post might be flagged as offensive because of accidental clicks) 60 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation 5.1.2 Robustness to Noisy Labels Deep learning approaches are known to suffer significant performance degradation when faced with noisy labels. This is because these approaches have the capacity to overfit an entire noisy training dataset, regardless of the level of noise present [199, 200]. As a result, various methods have been proposed to mitigate the negative impact of noisy labels. These approaches can be broadly categorized into four categories; robust architectures, robust regularization, robust loss design, and sample selection [170]. Limited research in NLP has investigated the susceptibility of models to the negative impacts of noisy labels. For instance, Jindal et al. [85] show that CNN models used in text classification tend to overfit noisy labels, leading to a decrease in generalization performance. They demonstrated that adding a noise adaptation layer can significantly reduce the adverse effects of noisy labels. On the contrary, Transformers have exhibited remarkable resilience to noisy labels [175, 203]. However, much of this research focuses on common benchmark NLP datasets with balanced label distributions, raising questions about whether this robustness persists in more practical settings with heavy label imbalance. 5.1.3 Parameter Efficient Tuning Methods Methods for PEFT have become an important area of research in addressing the challenges stemming from the massive parameter size of large language models [58]. PEFT methods involve maintaining the model parameters in a frozen state, and primarily operate by updating only a limited set of additional parameters within the model [72]. These methods allow for rapid adaptation to new tasks without experiencing catastrophic forgetting [145] and frequently demonstrate enhanced robustness in out-of-distribution evaluation [108]. Various approaches have been proposed for PEFT in recent years [104, 108, 80, 79]. Out of these approaches, LoRA [78] has been one of the most widely adopted. LoRA is designed with the Lottery Ticket Hypothesis [LTH; 55] in mind. According to the LTH, within densely connected, randomly initialized, feedforward networks, there exist smaller subnetworks that, when trained independently, can achieve performance comparable to the original network. LoRA operationalizes LTH by approximating the model parameter updates with low-rank matrices inserted between every layer of Transformers. While these methods enable more efficient adaptation, investigating whether PEFT methods retain the capabilities and behaviors of the full model, especially in regard to robustness to noisy labels, will provide insights into the trade-offs between efficiency and model reliability. 5.2 Experimental Setup We compare the performance of fine-tuning and LoRA-tuning of pre-trained language models when applied to training data that contain various degrees of noisy labels. To create datasets with varying levels of label noise, we randomly change the label of a data point with different probabilities ranging from 10% to 61 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.2: Percentage of clean samples correctly classified by LoRA. LoRA demonstrates a consistent ability to learn almost exclusively from the clean samples. 40%. This process, where the label corruption process is conditionally independent of the data, is known as instance-independent label noise [170]. We conducted our experiments on the IMDB dataset [115], and limited the training data size to 10000 samples. For all experiments, we kept the evaluation and test sets fixed. We use the RoBERTa-base [113] and train all models for 20 epochs with a learning rate of 1e-5 and a linear scheduler of 0.06. We used AdamW optimizer [114] with an L2 regularization of 0.01. For LoRA we used an α value of 16 and an r value of 8. 5.2.1 LoRA is Also Robust to Label Noise First, we compare the train and validation performance of LoRA and fine-tuning on the fully balanced IMDB training dataset with various levels of label noise. Our goal in this analysis is to investigate whether LoRA exhibits similar patterns of robustness to full fine-tuning. As shown in Figure 5.1, similar to full fine-tuning, LoRA achieves high validation performance of above 90% regardless of the level of noise present. However, the two methods behave differently on the training data. Specifically, we observe that full finetuning overfits all training data (including the noisy samples) consistently getting F1 scores of above 95% on the noisy training set. However, the training performance of LoRA plateaus. Furthermore, we observe that the maximum training performance of LoRA decreases from 93.8% to 55.3% as we increase the noise in the training dataset (see table Table D.1 for detailed results). This low performance on the noisy training set, in addition to high validation performance, suggests that LoRA might only be learning to predict the clean samples correctly. To gain deeper insights into the underlying mechanisms leading to LoRA’s robustness, we look into the accuracy of the model over both the noisy and clean sets as training progresses. Figure 5.2 shows what percentage of correctly classified samples are clean data points during the training. We observe that as 62 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.3: Number of forgettable data points for LoRA (blue) and fine-tuning (orange). LoRA consistently forgets fewer data points on the validation set. training progresses, over 90% of correctly classified data points come from the clean set. However, a stark contrast emerges when considering its performance with noisy samples. Despite the varying levels of noise, the model consistently resists fitting the noisy data, accurately classifying as few as 10% of the noisy samples (Figure 5.9). 5.2.2 Learning and Forgetting in LoRA The total number of forgettable datapoints reveals how models get impacted from noise over training, and points to their resilience to noisy labels (i.e., a model that forgets fewer datapoints as a result of increased noise can potentially generalize better even after facing noisy examples). Here, we define forgettable data points for a model as those initially learned during training (i.e., correctly classified at some point), yet subsequently forgotten (i.e., misclassified in the learning process). Figure 5.3 shows the number of forgettable data points for LoRA and fine-tuning for various levels of noise. Notably, LoRA consistently exhibits a low number of forgettable data points on the validation set, indicating its robustness, whereas the number of forgettable data points increases for fine-tuning as the level of noise over training data worsens. Both models exhibit similar trends for forgettable data points on the noisy training data, with the count increasing as the noise level rises. 5.2.3 Robustness in the Face of Data Imbalance Many real-world NLP applications lack balanced data distributions. For example, datasets for hate speech or offensive language detection often have a small fraction of positive samples [197]. To better understand the benefits of the observed robustness to label noise in practical settings, it’s crucial to acknowledge the prevalence of imbalanced data. To assess this, we constructed various versions of the IMDB dataset, keeping the training size constant at 10000 but varying the percentage of positive sentiment samples between 50%, 63 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.4: The best validation performance degradation happens for lower values of noise as imbalance worsens 40%, 30%, 20%, 10%, and 5%. For each version of the imbalanced dataset, we added varying degrees of noise conducted robustness to noise experiments as described in subsection 3.3.3. As depicted in Figure 5.4, compared to validation performance with no noise, the validation performance drops more as the imbalance intensifies. For example, the performance degrades by 5.2% when 40% of noise is added to the balanced dataset. However, this degradation is intensified to 12% with the same noise when the dataset is balanced at 5%. This widening gap underscores the challenge posed by imbalanced data and emphasizes the importance of developing robust NLP models capable of handling such scenarios effectively. Furthermore, we observe that this performance gap begins to manifest even at lower levels of noise in the data distribution. This early emergence of performance discrepancies highlights the sensitivity of NLP models to imbalanced datasets, suggesting that even a modest degree of imbalance can significantly impact model generalization. 5.3 LoRA Almost Exclusively Learns from the Clean Data Figure 5.9 illustrates the accuracy comparison between LoRA and fine-tuning on the noisy samples of the training set. A notable observation is the strikingly opposite patterns exhibited by the two approaches. LoRA consistently yields a lower accuracy, typically less than 10%, on the training data. Conversely, fine-tuning demonstrates the capability to adapt to noisy data irrespective of the noise level, achieving an accuracy of approximately 90% on both the noisy and clean subsets (Figure 5.5). 5.4 Deeper Dive into Learning and Forgetting In addition to performance, we track when data points are correctly classified for the first time (learning event) and when a data point that was previously learned is misclassified by the model (forgetting event). 64 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.5: Percentage of clean samples correctly classified by fine-tuning. Figure 5.10 presents a comparison of learning events in LoRA and fine-tuning. It is evident from the graph that in both approaches, the majority of learning events occur during the initial epoch, with LoRA consistently having fewer learning events compared to fine-tuning in these early stages. Yet, as shown in the figure, LoRA exhibits more learning events in later epochs compared to fine-tuning, especially in scenarios with higher noise levels. Figure 5.11 provides a comparison of forgetting events in LoRA and fine-tuning. We observe a clear distinction between the two approaches; namely, fine-tuning shows higher forgetting events throughout the training, especially for higher values of noise compared to LoRA. 5.5 Increasing Model Size To examine the influence of model size on robustness, we additionally conduct the analysis outlined in subsection 3.3.3 using RoBERTa-large. Looking at Figure 5.6 we observe similar patterns of robustness to noise to RoBERTA-base, the only notable difference is that RoBERTa-large plateaus at earlier epochs compared to RoBERTa-base. As depicted in Figure 5.7, the accuracy of RoBERTa-large on both clean and noisy training subsets is shown for different levels of noise. We note a pattern similar to RoBERTa-base. As shown in Figure 5.8, LoRA-tuning RoBERTa-large also exhibits notable ability in fitting clean samples while demonstrating resilience against overfitting noisy samples. However, we observe that the larger model learns the clean data (and unlearns noisy data) at earlier epochs compared to the base model. 65 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure 5.6: Learning dynamics for LoRA applied to RoBERTa-large on a balanced subset of the IMDB. 5.6 Discussion Our study highlights the efficacy and resilience of PEFT, particularly LoRA, in learning from noisy labels. Through our comprehensive analysis, we have shown that LoRA tuning not only retains the robustness to label noise exhibited by fine-tuning but also demonstrates unique advantages. Specifically, LoRA shows resistance to overfitting noisy labels, an ability to learn almost exclusively from clean data, and lower forgetting rates compared to fine-tuning. Additionally, our experiments shed light on label noise robustness in imbalanced training data. We found that imbalanced data exacerbates the effects of noisy label, particularly as the level of imbalance increases, even at lower noise levels. These findings highlight LoRA’s potential in real-world scenarios where noisy data and class imbalances prevail, offering a promising balance between efficiency and robustness for adapting large-scale language models to downstream tasks. This analysis is limited to English. Hence, the conclusions drawn may not fully translate to other languages or linguistic contexts due to differences in syntax, semantics, among other factors. Consequently, the applicability of these findings in multilingual or cross-cultural settings warrants careful consideration and potentially necessitates additional research to ascertain their broader relevance. Additionally, we acknowledge that the IMDB dataset is not devoid of noisy labels. However, since this dataset has been widely adopted in machine learning research, the extent of noise can be assumed to be limited. I also acknowledge that our analysis is limited in the type of noise explored. Variations in the nature of noise, such as instancedependent noise could lead to disparate results not explored within the scope of this chapter.I believe that my analysis and experimental design serve as a solid foundation for future researchers to explore other noise structures, such as instance-dependent noise. In summary, while this study provides valuable insights within the confines of chosen language models, methods, datasets, noise types, and linguistic context, it is essential to recognize the limitations inherent in these choices. Future research endeavors should aim to address these limitations by exploring alternative approaches, diverse datasets, and broader linguistic contexts to enrich 66 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation (a) Clean (b) Noisy Figure 5.7: Comparison of the accuracy on clean (left) and noisy (right) samples in the training set for LoRA applied to RoBERTa-large on balanced IMDB dataset. our understanding and enhance the generalizability of these findings. ... 67 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation (a) Learning (b) Forgetting Figure 5.8: Comparison of the accuracy on learning (right) and forgetting (left) for LoRA applied to RoBERTa-large on balanced IMDB dataset. (a) LoRA (b) Fine-tuning Figure 5.9: Comparison of the accuracy on noisy samples in the training set for LoRA (left) and fine-tuning (right) on balanced IMDB dataset. 68 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation (a) LoRA (b) Fine-tuning Figure 5.10: Comparison of learning events for LoRA (left) and fine-tuning (right) on balanced IMDB dataset. (a) LoRA (b) Fine-tuning Figure 5.11: Comparison of forgetting events for LoRA (left) and fine-tuning (right) on balanced IMDB dataset. 69 Chapter 6: Conclusions In this thesis, we delved into the intricate challenge of addressing problematic content, including hate speech, offensive language, and negative stereotypes, in digital spaces. Central to this thesis is the emphasis on the crucial role of interdisciplinary collaboration between social and computer scientists, recognizing that problematic content often arises from deep-seated societal and psychological issues. Throughout this thesis, I advocate for holistic solutions that extend beyond mere technological interventions; by integrating insights from social psychology with computational methodologies, I proposed comprehensive strategies to mitigate the detrimental impact of such content. In chapter 2, I exemplified the effectiveness of a theory-driven approach in mitigating stereotypical biases embedded within pretrained word embeddings and language models. By replacing the ad hoc, social-groupspecific aspect of existing bias mitigation approaches with a universal, theory-driven, social-group-agnostic alternative, I addressed two major shortcomings with prior debiasing work and paved the way for further exploration of theory-based methodologies. Firstly, I showed that, unlike previous methodologies that were group-specific, relying on any manually curated group-specific resources, my proposed theory-driven approach based on SCM reduces bias across social attributes. Secondly, I demonstrated that, unlike previous group-specific approaches, SCM-based debiasing is scalable. With group-specific approaches, the quality and utility of word embeddings and language models degrade as the number of social groups to mitigate biases for increases. However, SCM-based debiasing results in improvements in the quality of the respective word embeddings or language models. This chapter showcases how leveraging theories of stereotyping from social psychology offers a transformative approach to reducing biases in language models and fostering more inclusive outcomes. In chapter 3, I demonstrated that overlooking the social underpinnings of problematic online content in previous methodologies has led to the development of static solutions that struggle to adapt to the evolving landscape of such content. In response, I have devised a framework and benchmark based on continual learning, which recognizes the dynamic nature of problematic content online. This framework consists of two key components: firstly, a sequence of upstream problematic tasks to assess a model’s ability for knowledge accumulation, and secondly, a separate set of few-shot downstream tasks to evaluate a model’s adaptability in understanding novel forms of problematic content. Our experiments clearly illustrate the effectiveness of this approach, particularly in enabling models trained with this paradigm to adapt to new 70 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation types of problematic content. Furthermore, the developed benchmark serves as a foundation for measuring the progress our community is making in tackling the challenge of detecting problematic content online. This chapter establishes the groundwork for the development of computational tools aimed at combating problematic content, which acknowledge the dynamic social roots of this issue. In chapter 4, I highlighted the importance of culture in shaping perceptions of offensiveness and demonstrate how cultural diversity enhances offensive language detection in multilingual settings. The chapter begins by taking a step towards the overarching goal of this thesis: to develop inclusive, adaptable, and robust models for detecting a broader spectrum of problematic content, particularly by extending the established benchmark to multilingual settings. However, it became evident that the proposed framework did not readily generalize to such settings. Following this observation, I set out to study the cross-lingual transferability of offensive language detection by conducting a systematic examination of the influence of both cultural and linguistic factors on cross-lingual transferability across 105 language pairs. Through this analysis, I showed that linguistic proximity does not significantly affect transferability, yet there’s a significant enhancement in transfer when utilizing culturally diverse language pairs. This chapter highlights the limitations of solely depending on English-centric approaches. It underscores the significance of cultural context in offensive language detection, paving the way for building culturally sensitive offensive language detection methods. Finally, in chapter 5, I studied the adverse effects of noisy labels found in problematic content datasets on the efficacy of parameter-efficient solutions for detecting such content. These problematic content datasets are typically gathered through crowd-sourcing efforts, rendering them vulnerable to label inaccuracies. Furthermore, as the number of parameters in language models continues to increase, PEFT methods such as LoRA, offer computational efficiency with minimal performance degradation. However, their resilience to label noise has not been thoroughly explored. This chapter investigates whether models tuned with LoRA exhibit resistance to noisy labels. Through a series of experiments, several significant findings emerge: Firstly, LoRA demonstrates robustness to random noise comparable to full fine-tuning on balanced data, yet it avoids overfitting to noisy data. Secondly, in contrast to full fine-tuning, LoRA retains significantly more data points as noise levels increase. Thirdly, when examining how these robustness patterns evolve with imbalanced training data, it becomes evident that Transformers struggle with imbalance, with their robustness diminishing as the data becomes more skewed. This chapter underscores the potential of LoRA in real-world scenarios for addressing problematic content characterized by noise and data imbalance. Overall, this thesis represents a significant advancement in fostering safer and more inclusive digital spaces. However, it also highlights the ongoing need for us to sustain interdisciplinary collaboration and innovation to address the complex and evolving nature of problematic content in the digital age. While this thesis marks a major milestone in our quest for safer and more inclusive digital spaces, our journey to tackle the issue of problematic content online is far from over. By fostering ongoing interdisciplinary collaboration and embracing cutting-edge advancements in computational and social sciences, we can continue to push the boundaries of what we can achieve in mitigating the harmful effects of problematic content online. 71 Appendix A: Supplemental Material for Chapter 2 A.1 Implementation Details for Static Word Embeddings. A.1.1 Training Word Embeddings We used Gensim [153] implementation of Skip Gram with Negative Sampling variant of Word2Vec [125] to train a 300 dimensional word embedding model on the WikiText-103 [123] with a 5-word window. Words with fewer than 5 occurances in the corpus were dropped. Training was done for 5 iterations with 48 threads on a single AMD Ryzen Threadripper 2990WX CPU. A.1.2 Debiasing Algorithms We used Bolukbasi et al. [18]’s gender word sets and implementation of for HD. For Sub, LP and PP we follow Dev and Phillips [46]. The implementation can be found in the project repository. A.1.3 Bias Subspace The bias subspace used in HD is identical to that of Bolukbasi et al. [18]. Each dimensions’ bias subspace for Sub, LP, and PP was the first principal component of C constructed using 8 randomly sampled word pairs from the corresponding dimensions’ word pair list (Section A.1.5). A.1.4 Experiments Each debiasing algorithm for each dimension was conducted 30 times using a random sample of 8 pairs from the corresponding word list. 72 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation A.1.5 Word Pairs Social Group Word Pairs Gender nephews, nieces - nephew, niece - males, females - boys, girls - man, woman - sons, daughters - brother, sister - boy, girl - father, mother - guy, gal - male, female - uncle, aunt - himself, herself - uncles, aunts - fathers, mothers - his, her - son, daughter - him, her - men, women - his, hers - he, she - brothers, sisters - from [18]. Race Brad, Darnell - Brendan, Hakim - Geoffrey, Jermaine - Greg, Kareem - Brett, Jamal - Neil, Rasheed - Neil, Rasheed - Todd, Tyrone - Allison, Aisha - Anne, Ebony - Carrie, Keisha - Emily, Kenya - Laurie, Latoya - Meredith, Tamika - from [24]. Age Tiffany, Ethel - Michelle, Bernice - Cindy, Gertrude - Kristy, Agnes - Brad, Cecil - Eric, Wilbert - Joey, Mortimer - Billy, Edgar - from [24]. SCM Word Pairs for Static Word Embeddings Warmth pleasant, unpleasant - liked, disliked - outgoing, shy - sensitive, insensitive - friendliness, unfriendliness - sociable, unsociable - warm, cold - warmth, coldness - honest, dishonest - fair, unfair - loyal, disloyal - right, wrong - criminal, innocent - genuine, fake - reliable, unreliable - from [135]. Competence smart, stupid - competent, incompetent - intelligent, dumb - able, unable - rational, irrational - capable, incapable - aggressive, docile - resilient, nonresilient - motivated, unmotivated - ambitious, unambitious - independent, dependent - determined, inactive - secure, insecure - clever, foolish - dominant, submissive - from [135]. A.1.6 SCM-based Debiasing for More Social Groups We replicate our results with pre-trained embeddings, and (2) include additional social groups, we present our results for Word2Vec trained on Google News including additional groups of Asians, Hispanics, and fat vs. thin (Table A.1). The results show that our proposed framework succeeds in handling these additional dimensions and generalizes to other embeddings. Gender Black Asian Hispanic Age Fat-Thin PPsame 0.86 0.56 0.95 0.91 0.51 0.91 PPSCM 0.85 0.78 0.95 0.91 0.78 0.91 Table A.1: ECT for each group when the model is debiased along the same dimension (PPsame, i.e. debiasing on gender for gender) and using SCM (PPSCM) 73 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation A.2 Implementation Details for Debiasing Language Models A.2.1 SCM Words for Language Models We used the following words as SCM attributes in DPCE and ADEPT algorithm to mitigate the bias in the models: Warm: social, warm, popular, good, right, kind, loyal, pleasant, friendly, funny, moral, fair, sympathetic, sensitive, cooperative, innocent, liked, responsible, genuine, polite, trustworthy, reliable, caring, helpful, thoughtful. Cold: antisocial, cold, unpopular, bad, wrong, mean, treacherous, unpleasant, unfriendly, boring, immoral, unfair, unsympathetic, insensitive, selfish, criminal, disliked, irresponsible, fake, rude, untrustworthy, unreliable, uncaring, unhelpful, inconsiderate. Competent: able, bright, brilliant, competent, capable, wise, rational, practical, dominant, dependent, confident, active, efficient, ambitious, determined, critical, secure, daring, educated, aggressive, motivated, intelligent, graceful, creative, energetic. Incompetent: unable, stupid, dumb, incompetent, incapable, unwise, irrational, impractical, submissive, independent, insecure, inactive, inefficient, lazy, doubtful, naive, vulnerable, cautious, uneducated, docile, unmotivated, unintelligent, clumsy, unimaginative, lethargic. A.2.2 Gender and Religion words We used the same word lists for the gender and religion dimensions as [88, 196] did for our experiments. A.2.3 DPCE Sample Size After collecting sentences from corpora for the three different settings (gender, religion, and SCM - we separated 1000 samples for the evaluation and trained the model on the |S| = {100, 500, 1000, 5000}. We ran the StereoSet benchmark for different |S| and chose the best model based on the overall ICAT score (Table A.2). A.2.4 ADEPT + SCM To debias with ADEPT using SCM, we adopted a coordinate descent-based algorithm. At each epoch, we first neutralized warmth words with respect to the neutral words, then neutralized competence words with respect to those neutral words. These results are reported in chapter 2. We also experimented with this coordinate descent-based algorithm without using any explicit neutral words. Instead, at each epoch, we neutralized warmth words with respect to competence words, then neutralized competence words with respect to warmth words. In the perfect case of orthogonal warmth and competence axes, this debiasing procedure would hopefully retain all other attributes of words. In our 74 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation |S|=100 |S|=500 |S|=1000 |S|=5000 Gender Religion SCM Gender Religion SCM Gender Religion SCM Gender religion SCM Gender LMS ↑ 47 75.5 84.4 84.7 54.7 83.3 81.7 42.7 81.1 77 58.7 35.6 SS → 50 52.5 57.9 62.3 59.9 48.4 58.7 57.6 49.9 59.9 57.3 49 50.9 ICAT ↑ 44.7 63.5 63.6 68 53 68.8 69.4 42.7 65.1 65.7 57.5 35 Profession LMS ↑ 49.6 76.7 80.8 82.2 57.7 82.6 72.7 47.3 79.6 56.4 57.8 36.9 SS → 50 52.5 55.7 58.1 57.3 50.4 55.5 54.8 50.6 53.9 50.7 45.9 50.4 ICAT ↑ 47.2 67.9 67.7 70.3 57.2 73.5 65.8 46.7 73.3 55.5 53.1 36.6 Race LMS ↑ 47.1 81 80.9 82.9 74.8 83.5 72.3 46.2 82.8 57.4 57.2 34.8 SS → 50 50.2 55.8 59.1 56.3 52.1 55.9 52 46.6 56.6 59.8 54.2 43 ICAT ↑ 47 71.7 66.1 72.6 71.7 73.7 69.4 43.1 71.8 46.1 52.4 29.9 Religion LMS ↑ 41.6 79.9 79.7 80.1 80.6 83.5 72.2 73.7 81.6 54.1 73.7 27.8 SS → 50 47.7 58.8 60.5 57.8 57.6 57.9 50.1 55.5 55.4 60 58.4 46.8 ICAT ↑ 39.7 65.8 63 67.7 68.3 70.3 72 65.5 72.7 43.3 61.4 26.1 Overall LMS ↑ 47.8 78.7 81.3 82.8 66.2 83.1 73.6 47.2 81.3 59.3 58.2 35.4 SS → 50 51.2 56.1 59.2 57.1 51.2 56.1 53.6 48.8 56 56.2 50.6 46.9 ICAT ↑ 46.7 69 66.3 71 64.5 72.9 68.3 46.1 71.6 52 57.5 33.2 Table A.2: StereoSet benchmark for DPCE with different sample sizes.. experiments, we found no discernible difference when comparing this to the version with a separate set of neutral words, so we don’t report these results. Figure A.1 shows that many of the warmth and competence words indeed do not appear as close as they could be with respect to all the other words used. We speculate that with more effective adaptations of the ADEPT algorithm, SCM-based debiasing with this algorithm might be able to achieve better results on various bias benchmarks. 75 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure A.1: t-SNE plot after running ADEPT based on SCM. Some pairwise words, mostly in the warmth dimension (red/yellow) cluster together, but others, mostly in the competence dimension (green/cyan) do not. 76 Appendix B: Supplemental Material for Chapter 3 B.1 Supplementary Material B.1.1 Hardware and Runtimes Experiments were conducted on Nvidia Quadro 6000 GPUs with Cuda version 11.4. Each upstream training for 26 tasks takes around 12 hours, and few-shot training and evaluation for all 58 downstream tasks for a single model takes around 6 hours to complete. B.1.2 Data Sources, Statistics, and License Information All of the datasets used in this benchmark are publicly available for research purposes. Table B.4 provides license information for all datasets. We do not redistribute these datasets. In our Github repository1 we offer a clear guide on how to create a local copy of all the datasets used in our benchmark, from the original sources. Our benchmark consists of English classification datasets that contain tasks related to problematic content detection. Each label from each dataset is treated as a separate task and we only used tasks with more than 100 positive examples in their training sets. Table B.1 and B.2 show dataset statistics along with the number of positive samples per task for downstream and upstream tasks, respectively. Table B.3 shows number of datasets from each source. B.1.3 Model Implementation Details For all experiments, we used a batch size of 32 and trained the models for at most 100 epochs. To prevent the model from overfitting, we used early stopping with a patience of three and chose the best model based on the F1 score.Due to the paucity of problematic content online most of the datasets in this benchmark are heavily sparse. This sparsity poses challenges to the optimization process. To address this, we used a weighted random sampler ensuring each batch consists of at least 30% positive samples. 1https://github.com/........ 77 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Dataset Labels Abusive abusive (2763); hateful (503); total (8597) CAD affiliation directed abuse (242) ; identity directed abuse (514); person directed abuse (237); total (5307) Dygen hate (2268); total (4120) GHC human degradation (491); vulgarity (369); total (5510) Gate hateful (170); offensive (1247); total (10207) Civil comments identity attack (687); insult (5776); obscene (543); threat (221); toxicity (7777); total (97320) Personal attack attack (3056) ; recipient attack (1999) ; third party attack (204); total (23178) UCC antagonize (203); condescending (269) ; dismissive (150) ; generalisation (96) ; generalisation unfair (91) ; healthy (320) ; hostile (108) ; sarcastic (201) ; total (4425) Table B.1: Number of label occurrences in upstream tasks test sets. Adapter: To implement Adapter models, we added an adapter [75] between each layer of BART transformers. The adapter consists of an autoencoder with input and output layers of size equal to embedding dimensions and a hidden layer of size of 256 in the middle. BiHNet: The BiHNet uses is an extension of the hypernetworks. BiHNet computes two different losses using two forms of task representation, long task representation and short task representation, to generate wights for the classification model. In our experiments, we calculated the long task representation by averaging the embedding of all text samples in the training split of a dataset. The short task embeddings, which are designed to help the model in few-shot settings, were computed by averaging embeddings of 64 texts sampled from the training set. For both long and short task representations, we used sentence-transformers [154] 2 with mean pooling. The final model weights are calculated as the sum of weights generated using long and short task representations. Following [84], we used a two-layer MLP model with a hidden size of 32 as our weight generator hypernetwork for each classification model. When BiHNet was used in a model variatoin that utilizes adapters, we used it to only generate the weights of all adapters in addition to each classification head. Multitask Learning: In the multi-task setting, we used hard parameter sharing. For Adapter-Multitask models we shared only the adapter parameters and for BiHNet-Multitask models we used a shared BiHNet for all tasks. We use the BiHNet to generate task-specific parameters using the long and short task-specific representations. Continual Learning Parameters: For BiHNet-Reg and BiHNet-EWC, both of which are approaches based on regularization [91], we used regularization coefficient of 0.01. Downstream Adaptation: For downstream adaptation, we conducted few-shot training for 800 epochs 2https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1 78 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Dataset Labels Dygen Black men (7); African (8); Muslim women (10); Asylum seekers (13); Asians (15); Indigenous people (18); Gender minorities (21); Chinese people (25); Foreigners (26); Black women (27); Travellers (27); Nonwhites (28); Mixed race (30); Gay women (31); East Asians (32); South Asians (32). Gay men (34); support (35); Arabs (45); threatening (48); Refguees (51); dehumanization (70); People with disabilities (79); Gay people (81); Immigrants (81); Trans people (90); Jewish people (111); Muslims (126); Black (211); animosity (315); derogation (1036); total (3009) CONAN disabled (22); jews (59); muslims (134); migrant (96); woman (67); LGBT (62); people of color (35); total (501) Hatecheck trans (42); black (44); immigrants (45); muslims (47); gay (48); disabled (50); women (60); hate (117); total (373) single adversarial toxic (300); total (3000) multi BAD2 toxic (44); total (190) BAD4 Stormfront hate (239); total (478) US-election hateful (31); total (300) GHC calls for violence (24); total (5510) CAD counter-speech (66); total (5307) Misogyny misogynistic (73); total (657) CMSB sexist (181); total (2363) Table B.2: Number of label occurrences in downstream tasks test sets. Source: Twitter (6); Reddit (2); Wikipedia (2); Gab (1) ; Stormfront (1); Chat dialogue (4); Synthetically generated (2); Civil Comments (1). Table B.3: Number of datasets by source. 79 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Table B.4: License information for all datasets used in the benchmark. According to this information, all datasets can be used for research purposes Name License Source UCC and Ex Machina CC-BY-SA https://en.wikipedia.org/wiki/ Wikipedia:Copyrights#Contributors’ _rights_and_obligations Civil Comments Corpus CC0 https://www.kaggle.com/ competitions/jigsaw unintended bias in toxicity classification/data Misogyny Detection MIT https://github.com/ellamguest/ online-misogyny-eacl2021 CAD CC-By Attribution 4.0 International https://zenodo.org/record/4881008 DYGEN CC By 4.0 Footnote of the first page of the paper: https://dl.acm.org/doi/pdf/10.1145/ 3580305.3599318 HateCheck CC By 4.0 https://github.com/paul-rottger/ hatecheck-data/blob/main/LICENSE CONAN ”resources can be used for research purposes” https://github.com/marcoguerini/CONAN Stormfront CC-BY-SA https://github.com/Vicomtech/ hate-speech-dataset GHC CC-By Attribution 4.0 International The GHC is available on the Open Science Framework (OSF, https://osf.io/edua3/), and the license is discussed in detail in section 4 of the paper CMSB CC BY-NC-SA 4.0 https://data.gesis.org/sharing/#! Detail/10.7802/2251 Large-Scale Hate Speech Detection with Cross-Domain Transfer CC-BY-SA 4.0 https://github.com/avaapm/hatespeech/ blob/master/LICENSE US Election data is publicly available https://www.ims.uni-stuttgart. de/forschung/ressourcen/korpora/ stance-hof/ Dialogue Safety MIT https://github.com/facebookresearch/ ParlAI/blob/main/LICENSE Twitter Abusive CC-By Attribution 4.0 International https://zenodo.org/record/2657374 80 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation with a batch size of 8 for 8-shot experiments. For 16-shot and 32-shot experiments, we used a batch size of 16. Since the total number of training samples is less than 64 in our downstream few-shot adaptations, we only use the long task representation for BiHNet models. For Adapter-Multitask, we initialize a new classification head for each downstream task. However, for the Adapter-Vanilla model, we keep the existing classification head. B.1.4 Detailed Results Adapter-Multitask Adapter-Vanilla BiHnet-MultiTask BiHNet-EWC BiHNet-Reg Adapter-Vanilla 0.015 - - - - BiHNet-Multitask 0.334 0.144 - - - BiHNet-Reg 0.018* 0.955 0.159 - - BiHNet- EWC 0.916 0.012* 0.284 0.014* - BiHNet-Vanilla 0.037* 0.738 0.259 0.781 0.028* Table B.5: P values to pairwise T-test between the fewshot performances for experiments with the chronological order of upstream tasks. Below we provide detailed results, including AUC and F1 scores, for all upstream and downstream tasks in our experiments. Specifically, tables B.6 and B.7 show detailed results for upstream training on experiments with chronological and random upstream task order. Table B.8 and B.9 provide detailed results on all downstream tasks for chronological and random upstream task order respectively. Table B.5 shows the p values for pairwise T-tests conducted on the fewshot AUC of various models. Our results indicate a significant difference between Adapter-Vanilla and BiHNet-Reg in downstream adaptation (i.e., few-shot). Furthermore, there is no significant difference between the BiHNet-Reg and Multitask models which are considered as the upper bounds. However, BiHNet-Reg significantly outperforms classic continual learning approaches such as EWC. These findings underscore the importance of developing continual learning approaches that have an emphasis on generalization as solutions to practical scenarios for dealing with the ever-evolving nature of problematic content. 81 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation order dataset task model final-f1 instant-f1 final-auc instant-auc 0 personal-attack a Adapter-Vanilla 0.305 0.751 0.541 0.963 0 personal-attack a BiHNet+Vanilla 0.265 0.750 0.728 0.957 0 personal-attack a BiHNet+Reg 0.743 0.752 0.957 0.960 0 personal-attack a BiHNet+EWC 0.296 0.738 0.897 0.955 - personal-attack a Adapter-Multitask 0.704 - 0.958 - - personal-attack a BiHNet-Multitask 0.747 - 0.955 - 1 personal-attack tpa Adapter-Vanilla 0.063 0.321 0.461 0.948 1 personal-attack tpa BiHNet+Vanilla 0.061 0.296 0.639 0.938 1 personal-attack tpa BiHNet+Reg 0.094 0.224 0.884 0.929 1 personal-attack tpa BiHNet+EWC 0.051 0.276 0.826 0.924 - personal-attack tpa Adapter-Multitask 0.311 - 0.941 - - personal-attack tpa BiHNet-Multitask 0.101 - 0.895 - 2 personal-attack ra Adapter-Vanilla 0.360 0.722 0.602 0.969 2 personal-attack ra BiHNet+Vanilla 0.340 0.730 0.786 0.970 2 personal-attack ra BiHNet+Reg 0.684 0.713 0.965 0.968 2 personal-attack ra BiHNet+EWC 0.386 0.733 0.925 0.969 - personal-attack ra Adapter-Multitask 0.679 - 0.971 - - personal-attack ra BiHNet-Multitask 0.682 - 0.959 - 3 jigsaw threat Adapter-Vanilla 0.105263 0.099762 0.863857 0.987086 3 jigsaw threat BiHNet+Vanilla 0.084746 0.119318 0.839035 0.983698 3 jigsaw threat BiHNet+Reg 0.013133 0.130612 0.747348 0.983460 3 jigsaw threat BiHNet+EWC 0.037736 0.086580 0.741358 0.986048 - jigsaw threat BiHNet-Multitask 0.031847 - 0.944563 - - jigsaw threat Adapter-Multitask 0.067901 - 0.981188 - 4 jigsaw insult Adapter-Vanilla 0.131 0.561 0.486 0.948 4 jigsaw insult BiHNet+Vanilla 0.080 0.548 0.595 0.944 4 jigsaw insult BiHNet+Reg 0.423 0.556 0.888 0.944 Continued on next page 82 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 4 jigsaw insult BiHNet+EWC 0.026 0.586 0.647 0.945 - jigsaw insult BiHNet-Multitask 0.484 - 0.926 - - jigsaw insult Adapter-Multitask 0.496 - 0.947 - 5 jigsaw toxicity Adapter-Vanilla 0.146 0.569 0.497 0.938 5 jigsaw toxicity BiHNet+Vanilla 0.088 0.575 0.616 0.931 5 jigsaw toxicity BiHNet+Reg 0.434 0.577 0.875 0.934 5 jigsaw toxicity BiHNet+EWC 0.025 0.545 0.653 0.934 - jigsaw toxicity BiHNet-Multitask 0.553 - 0.924 - - jigsaw toxicity Adapter-Multitask 0.495 - 0.935 - 6 jigsaw identity-attack Adapter-Vanilla 0.053 0.192 0.543 0.983 6 jigsaw identity-attack BiHNet+Vanilla 0.040 0.173 0.623 0.981 6 jigsaw identity-attack BiHNet+Reg 0.042 0.142 0.823 0.974 6 jigsaw identity-attack BiHNet+EWC 0.046 0.161 0.611 0.983 - jigsaw identity-attack BiHNet-Multitask 0.073 - 0.918 - - jigsaw identity-attack Adapter-Multitask 0.166 - 0.972 - 7 jigsaw obscene Adapter-Vanilla 0.046 0.200 0.423 0.973 7 jigsaw obscene BiHNet+Vanilla 0 0.289 0.676 0.979 7 jigsaw obscene BiHNet+Reg 0.051 0.157 0.901 0.969 7 jigsaw obscene BiHNet+EWC 0 0.172 0.651 0.977 - jigsaw obscene BiHNet-Multitask 0.066 - 0.950 - - jigsaw obscene Adapter-Multitask 0.114 - 0.962 - 8 abusive abusive Adapter-Vanilla 0.045 0.906 0.165 0.977 8 abusive abusive BiHNet+Vanilla 0.042 0.905 0.513 0.975 8 abusive abusive BiHNet+Reg 0.782 0.906 0.912 0.974 8 abusive abusive BiHNet+EWC 0.032 0.901 0.686 0.975 - abusive abusive BiHNet-Multitask 0.872 - 0.930 - - abusive abusive Adapter-Multitask 0.901 - 0.973 - Continued on next page 83 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 9 abusive hateful Adapter-Vanilla 0.075 0.393 0.477 0.863 9 abusive hateful BiHNet+Vanilla 0.068 0.434 0.591 0.862 9 abusive hateful BiHNet+Reg 0.207 0.392 0.772 0.858 9 abusive hateful BiHNet+EWC 0.080 0.419 0.725 0.860 - abusive hateful BiHNet-Multitask 0.188 - 0.779 - - abusive hateful Adapter-Multitask 0.430 - 0.833 - 10 ghc hd Adapter-Vanilla 0.183 0.422 0.523 0.871 10 ghc hd BiHNet+Vanilla 0.139 0.438 0.608 0.860 10 ghc hd BiHNet+Reg 0.371 0.390 0.840 0.864 10 ghc hd BiHNet+EWC 0.063 0.413 0.701 0.865 - ghc hd Adapter-Multitask 0.423 - 0.863 - - ghc hd BiHNet-Multitask 0.380 - 0.837 - 11 ghc vo Adapter-Vanilla 0.224 0.491 0.542 0.904 11 ghc vo BiHNet+Vanilla 0.169 0.501 0.676 0.906 11 ghc vo BiHNet+Reg 0.325 0.497 0.851 0.907 11 ghc vo BiHNet+EWC 0.089 0.504 0.737 0.899 - ghc vo Adapter-Multitask 0.461 - 0.892 - - ghc vo BiHNet-Multitask 0.395 - 0.863 - 12 ucc hostile Adapter-Vanilla 0.167 0.210 0.566 0.848 12 ucc hostile BiHNet+Vanilla 0.058 0.218 0.583 0.812 12 ucc hostile BiHNet+Reg 0.103 0.205 0.722 0.852 12 ucc hostile BiHNet+EWC 0.018 0.202 0.615 0.833 - ucc hostile Adapter-Multitask 0.189 - 0.819 - - ucc hostile BiHNet-Multitask 0.139 - 0.773 - 13 ucc generalisation Adapter-Vanilla 0.083 0.156 0.449 0.826 13 ucc generalisation BiHNet+Vanilla 0.079 0.199 0.543 0.853 13 ucc generalisation BiHNet+Reg 0.140 0.183 0.840 0.868 Continued on next page 84 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 13 ucc generalisation BiHNet+EWC 0.040 0.167 0.641 0.836 - ucc generalisation Adapter-Multitask 0.184 - 0.848 - - ucc generalisation BiHNet-Multitask 0.105 - 0.768 - 14 ucc dismissive Adapter-Vanilla 0.100 0.194 0.601 0.789 14 ucc dismissive BiHNet+Vanilla 0.033 0.208 0.565 0.791 14 ucc dismissive BiHNet+Reg 0.104 0.231 0.643 0.808 14 ucc dismissive BiHNet+EWC 0.013 0.225 0.595 0.804 - ucc dismissive Adapter-Multitask 0.241 - 0.798 - - ucc dismissive BiHNet-Multitask 0.162 - 0.741 - 15 ucc antagonize Adapter-Vanilla 0.095 0.226 0.553 0.826 15 ucc antagonize BiHNet+Vanilla 0.019 0.254 0.571 0.826 15 ucc antagonize BiHNet+Reg 0.155 0.244 0.712 0.831 15 ucc antagonize BiHNet+EWC 0 0.245 0.608 0.832 - ucc antagonize Adapter-Multitask 0.239 - 0.790 - - ucc antagonize BiHNet-Multitask 0.182 - 0.744 - 16 ucc condescending Adapter-Vanilla 0.068 0.241 0.538 0.775 16 ucc condescending BiHNet+Vanilla 0.022 0.250 0.495 0.776 16 ucc condescending BiHNet+Reg 0.138 0.252 0.631 0.786 16 ucc condescending BiHNet+EWC 0.008 0.247 0.539 0.759 - ucc condescending Adapter-Multitask 0.248 - 0.759 - - ucc condescending BiHNet-Multitask 0.175 - 0.701 - 17 ucc sarcastic Adapter-Vanilla 0.040 0.147 0.524 0.697 17 ucc sarcastic BiHNet+Vanilla 0.017 0.154 0.521 0.694 17 ucc sarcastic BiHNet+Reg 0.102 0.173 0.579 0.708 17 ucc sarcastic BiHNet+EWC 0.010 0.165 0.490 0.713 - ucc sarcastic Adapter-Multitask 0.074 - 0.664 - - ucc sarcastic BiHNet-Multitask 0.113 - 0.630 - Continued on next page 85 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 18 ucc healthy Adapter-Vanilla 0.071 0.247 0.538 0.727 18 ucc healthy BiHNet+Vanilla 0.027 0.238 0.568 0.715 18 ucc healthy BiHNet+Reg 0.212 0.250 0.666 0.716 18 ucc healthy BiHNet+EWC 0.006 0.256 0.575 0.730 - ucc healthy BiHNet-Multitask 0.180 - 0.692 - - ucc healthy Adapter-Multitask 0.194 - 0.702 - 19 ucc generalisation Adapter-Vanilla 0.078 0.231 0.453 0.836 19 ucc generalisation BiHNet+Vanilla 0.075 0.216 0.544 0.820 19 ucc generalisation BiHNet+Reg 0.153 0.240 0.836 0.844 19 ucc generalisation BiHNet+EWC 0.037 0.236 0.643 0.845 - ucc generalisation BiHNet-Multitask 0.118 - 0.763 - - ucc generalisation Adapter-Multitask 0.228 - 0.832 - 20 dygen hate Adapter-Vanilla 0.162 0.778 0.556 0.830 20 dygen hate BiHNet+Vanilla 0.107 0.771 0.536 0.807 20 dygen hate BiHNet+Reg 0.619 0.737 0.667 0.762 20 dygen hate BiHNet+EWC 0.058 0.774 0.521 0.820 - dygen hate Adapter-Multitask 0.732 - 0.810 - - dygen hate BiHNet-Multitask 0.713 - 0.759 - 21 cad personal Adapter-Vanilla 0.170 0.412 0.482 0.868 21 cad personal BiHNet+Vanilla 0.120 0.422 0.574 0.871 21 cad personal BiHNet+Reg 0.084 0.408 0.613 0.881 21 cad personal BiHNet+EWC 0.114 0.413 0.660 0.883 - cad personal Adapter-Multitask 0.435 - 0.893 - - cad personal BiHNet-Multitask 0.274 - 0.812 - 22 cad identity Adapter-Vanilla 0.127 0.400 0.532 0.808 22 cad identity BiHNet+Vanilla 0.098 0.402 0.576 0.794 22 cad identity BiHNet+Reg 0.147 0.380 0.567 0.796 Continued on next page 86 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 22 cad identity BiHNet+EWC 0.085 0.424 0.600 0.802 - cad identity Adapter-Multitask 0.363 - 0.771 - - cad identity BiHNet-Multitask 0.264 - 0.730 - 23 cad affiliation Adapter-Vanilla 0.069 0.434 0.380 0.880 23 cad affiliation BiHNet+Vanilla 0.099 0.457 0.524 0.875 23 cad affiliation BiHNet+Reg 0.423 0.441 0.846 0.883 23 cad affiliation BiHNet+EWC 0.073 0.446 0.563 0.860 - cad affiliation Adapter-Multitask 0.402 - 0.853 - - cad affiliation BiHNet-Multitask 0.354 - 0.808 - 24 hate offensive Adapter-Vanilla 0.094 0.805 0.392 0.977 24 hate offensive BiHNet+Vanilla 0.065 0.803 0.644 0.976 24 hate offensive BiHNet+Reg 0.143 0.815 0.898 0.979 24 hate offensive BiHNet+EWC 0.041 0.806 0.819 0.978 - hate offensive Adapter-Multitask 0.792 - 0.980 - - hate offensive BiHNet-Multitask 0.785 - 0.967 - 25 hate hateful Adapter-Vanilla 0.327 0.347 0.771 0.927 25 hate hateful BiHNet+Vanilla 0.368 0.378 0.920 0.946 25 hate hateful BiHNet+Reg 0.373 0.395 0.945 0.947 25 hate hateful BiHNet+EWC 0.292 0.345 0.916 0.938 - hate hateful BiHNet-Multitask 0.144 - 0.913 - - hate hateful Adapter-Multitask 0.382 - 0.945 - Table B.6: Final and instant AUC and F1 scores for upstream tasks for the chronological experiment order dataset task model final-f1 instant-f1 final-auc instant-auc 1 jigsaw obscene Adapter-Vanilla 0.020779 0.199005 0.634348 0.977025 1 jigsaw obscene BiHNet+Vanilla 0.026471 0.194175 0.726092 0.979034 1 jigsaw obscene BiHNet+Reg 0.117117 0.208877 0.946478 0.978208 Continued on next page 87 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 1 jigsaw obscene BiHNet+EWC 0.035088 0.298387 0.649704 0.976722 - jigsaw obscene Adapter-Multitask 0.202667 - 0.970656 - - jigsaw obscene BiHNet-Multitask 0.092511 - 0.944722 - 2 ucc generalisation Adapter-Vanilla 0.123967 0.256198 0.658976 0.860923 2 ucc generalisation BiHNet+Vanilla 0.107817 0.222222 0.706271 0.853472 2 ucc generalisation BiHNet+Reg 0.083832 0.206061 0.682753 0.860750 2 ucc generalisation BiHNet+EWC 0.105263 0.222841 0.653317 0.871959 - ucc generalisation Adapter-Multitask 0.185714 - 0.838597 - - ucc generalisation BiHNet-Multitask 0.113861 - 0.707083 - 3 hate hateful Adapter-Vanilla 0.100000 0.396985 0.688817 0.940574 3 hate hateful BiHNet+Vanilla 0.080491 0.396450 0.693829 0.939336 3 hate hateful BiHNet+Reg 0.119177 0.334096 0.774023 0.940949 3 hate hateful BiHNet+EWC 0.071477 0.389423 0.544535 0.944195 - hate hateful Adapter-Multitask 0.407692 - 0.960242 - - hate hateful BiHNet-Multitask 0.152436 - 0.914408 - 4 dygen hate Adapter-Vanilla 0.586525 0.772302 0.734833 0.828820 4 dygen hate BiHNet+Vanilla 0.637133 0.782263 0.706050 0.837907 4 dygen hate BiHNet+Reg 0.606033 0.748860 0.613006 0.762699 4 dygen hate BiHNet+EWC 0.547778 0.790928 0.706884 0.850217 - dygen hate Adapter-Multitask 0.750575 - 0.819942 - - dygen hate BiHNet-Multitask 0.713164 - 0.760064 - 5 ucc healthy Adapter-Vanilla 0.089796 0.252822 0.607956 0.723211 5 ucc healthy BiHNet+Vanilla 0.130506 0.245672 0.607529 0.717350 5 ucc healthy BiHNet+Reg 0.200000 0.280778 0.680537 0.720583 5 ucc healthy BiHNet+EWC 0.124567 0.239151 0.602608 0.709075 - ucc healthy Adapter-Multitask 0.224204 - 0.690258 - - ucc healthy BiHNet-Multitask 0.207002 - 0.690280 - Continued on next page 88 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 6 jigsaw threat Adapter-Vanilla 0.011019 0.123077 0.590772 0.987871 6 jigsaw threat BiHNet+Vanilla 0.006211 0.109375 0.693627 0.985106 6 jigsaw threat BiHNet+Reg 0.012539 0.095455 0.823852 0.989349 6 jigsaw threat BiHNet+EWC 0.008119 0.107969 0.606869 0.989180 - jigsaw threat Adapter-Multitask 0.094808 - 0.980725 - - jigsaw threat BiHNet-Multitask 0.047511 - 0.947328 - 7 ucc condescending Adapter-Vanilla 0.056122 0.246080 0.569447 0.785604 7 ucc condescending BiHNet+Vanilla 0.084130 0.243767 0.570273 0.783299 7 ucc condescending BiHNet+Reg 0.162839 0.232461 0.646058 0.776889 7 ucc condescending BiHNet+EWC 0.098160 0.238443 0.587424 0.787313 - ucc condescending Adapter-Multitask 0.207407 - 0.746329 - - ucc condescending BiHNet-Multitask 0.169611 - 0.703610 - 8 ucc hostile Adapter-Vanilla 0.079051 0.210169 0.601122 0.837135 8 ucc hostile BiHNet+Vanilla 0.070652 0.193853 0.594370 0.813944 8 ucc hostile BiHNet+Reg 0.190476 0.213992 0.789258 0.855591 8 ucc hostile BiHNet+EWC 0.105572 0.206522 0.602163 0.831534 - ucc hostile Adapter-Multitask 0.213198 - 0.828848 - - ucc hostile BiHNet-Multitask 0.150235 - 0.803156 - 9 ucc antagonize Adapter-Vanilla 0.085366 0.260870 0.627160 0.824417 9 ucc antagonize BiHNet+Vanilla 0.101545 0.239726 0.620268 0.823707 9 ucc antagonize BiHNet+Reg 0.200000 0.244275 0.760923 0.830485 9 ucc antagonize BiHNet+EWC 0.095465 0.259819 0.577579 0.803287 - ucc antagonize Adapter-Multitask 0.201780 - 0.790624 - - ucc antagonize BiHNet-Multitask 0.187373 - 0.786051 - 10 jigsaw identity-attack Adapter-Vanilla 0.100503 0.213043 0.841030 0.979880 10 jigsaw identity-attack BiHNet+Vanilla 0.082739 0.241470 0.877627 0.982284 10 jigsaw identity-attack BiHNet+Reg 0.033691 0.223350 0.805231 0.982487 Continued on next page 89 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 10 jigsaw identity-attack BiHNet+EWC 0.040332 0.232295 0.761092 0.981215 - jigsaw identity-attack Adapter-Multitask 0.145833 - 0.973271 - - jigsaw identity-attack BiHNet-Multitask 0.085837 - 0.905618 - 11 jigsaw toxicity Adapter-Vanilla 0.177102 0.576288 0.686841 0.938429 11 jigsaw toxicity BiHNet+Vanilla 0.222537 0.580645 0.696388 0.935391 11 jigsaw toxicity BiHNet+Reg 0.552076 0.543160 0.918422 0.930403 11 jigsaw toxicity BiHNet+EWC 0.173575 0.577108 0.622396 0.937142 - jigsaw toxicity Adapter-Multitask 0.573469 - 0.935680 - - jigsaw toxicity BiHNet-Multitask 0.552855 - 0.922125 - 12 PA tpa Adapter-Vanilla 0.071197 0.365297 0.713532 0.949232 12 PA tpa BiHNet+Vanilla 0.065125 0.357942 0.806359 0.912470 12 PA tpa BiHNet+Reg 0.072626 0.366197 0.841588 0.934629 12 PA tpa BiHNet+EWC 0.074959 0.364000 0.756201 0.930620 - PA tpa Adapter-Multitask 0.364035 - 0.947569 - - PA tpa BiHNet-Multitask 0.105491 - 0.902844 - 13 cad affiliation Adapter-Vanilla 0.148270 0.494845 0.618436 0.887943 13 cad affiliation BiHNet+Vanilla 0.151282 0.470825 0.664817 0.888610 13 cad affiliation BiHNet+Reg 0.129193 0.419682 0.643099 0.879390 13 cad affiliation BiHNet+EWC 0.104972 0.502530 0.550398 0.908008 - cad affiliation Adapter-Multitask 0.449064 - 0.878271 - - cad affiliation BiHNet-Multitask 0.317204 - 0.804172 - 14 ucc generalisation Adapter-Vanilla 0.120000 0.235897 0.660351 0.848341 14 ucc generalisation BiHNet+Vanilla 0.122016 0.237288 0.705748 0.859008 14 ucc generalisation BiHNet+Reg 0.096677 0.226164 0.685203 0.875159 14 ucc generalisation BiHNet+EWC 0.107955 0.232258 0.653448 0.874206 - ucc generalisation Adapter-Multitask 0.219178 - 0.834728 - - ucc generalisation BiHNet-Multitask 0.125604 - 0.710813 - Continued on next page 90 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 15 ghc hd Adapter-Vanilla 0.351351 0.425131 0.763803 0.870509 15 ghc hd BiHNet+Vanilla 0.351544 0.443587 0.793900 0.879318 15 ghc hd BiHNet+Reg 0.308617 0.412698 0.780278 0.872039 15 ghc hd BiHNet+EWC 0.291815 0.428850 0.697856 0.878094 - ghc hd Adapter-Multitask 0.391257 - 0.854813 - - ghc hd BiHNet-Multitask 0.363448 - 0.827565 - 16 hate offensive Adapter-Vanilla 0.352511 0.802792 0.685974 0.978245 16 hate offensive BiHNet+Vanilla 0.371750 0.805515 0.720766 0.977552 16 hate offensive BiHNet+Reg 0.781868 0.785835 0.955545 0.979944 16 hate offensive BiHNet+EWC 0.373037 0.809084 0.594099 0.976769 - hate offensive Adapter-Multitask 0.799446 - 0.976373 - - hate offensive BiHNet-Multitask 0.766355 - 0.962001 - 17 abusive hateful Adapter-Vanilla 0.270035 0.458667 0.763553 0.858683 17 abusive hateful BiHNet+Vanilla 0.278997 0.410728 0.770081 0.854976 17 abusive hateful BiHNet+Reg 0.165092 0.424520 0.666253 0.864749 17 abusive hateful BiHNet+EWC 0.275524 0.421230 0.728667 0.849809 - abusive hateful Adapter-Multitask 0.420432 - 0.843342 - - abusive hateful BiHNet-Multitask 0.189639 - 0.774595 - 18 ucc dismissive Adapter-Vanilla 0.047138 0.235589 0.588588 0.825034 18 ucc dismissive BiHNet+Vanilla 0.060748 0.220994 0.591715 0.822811 18 ucc dismissive BiHNet+Reg 0.146835 0.207299 0.681038 0.819899 18 ucc dismissive BiHNet+EWC 0.065327 0.229508 0.576748 0.808745 - ucc dismissive Adapter-Multitask 0.145923 - 0.801140 - - ucc dismissive BiHNet-Multitask 0.162839 - 0.769410 - 19 PA a Adapter-Vanilla 0.430756 0.774558 0.797523 0.962485 19 PA a BiHNet+Vanilla 0.519235 0.760917 0.857912 0.963369 19 PA a BiHNet+Reg 0.733024 0.748555 0.947966 0.961693 Continued on next page 91 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 19 PA a BiHNet+EWC 0.455738 0.761735 0.829767 0.962449 - PA a Adapter-Multitask 0.755801 - 0.961488 - - PA a BiHNet-Multitask 0.708326 - 0.950576 - 20 cad personal Adapter-Vanilla 0.116608 0.381703 0.589687 0.878956 20 cad personal BiHNet+Vanilla 0.165088 0.381356 0.637047 0.864690 20 cad personal BiHNet+Reg 0.141732 0.391681 0.609079 0.880668 20 cad personal BiHNet+EWC 0.139053 0.396552 0.569930 0.869262 - cad personal Adapter-Multitask 0.381963 - 0.868548 - - cad personal BiHNet-Multitask 0.264045 - 0.801124 - 21 jigsaw insult Adapter-Vanilla 0.159140 0.548837 0.673561 0.951626 21 jigsaw insult BiHNet+Vanilla 0.168421 0.618182 0.663345 0.950417 21 jigsaw insult BiHNet+Reg 0.561667 0.525070 0.934685 0.949777 21 jigsaw insult BiHNet+EWC 0.134516 0.555082 0.589250 0.947814 - jigsaw insult Adapter-Multitask 0.591755 - 0.948925 - - jigsaw insult BiHNet-Multitask 0.483471 - 0.916784 - 22 ucc sarcastic Adapter-Vanilla 0.051576 0.179817 0.537452 0.715202 22 ucc sarcastic BiHNet+Vanilla 0.058700 0.156682 0.535132 0.707973 22 ucc sarcastic BiHNet+Reg 0.090909 0.158956 0.632267 0.710375 22 ucc sarcastic BiHNet+EWC 0.090703 0.158163 0.615797 0.714295 - ucc sarcastic Adapter-Multitask 0.115385 - 0.675992 - - ucc sarcastic BiHNet-Multitask 0.057582 - 0.590061 - 23 ghc vo Adapter-Vanilla 0.339791 0.474674 0.784665 0.893579 23 ghc vo BiHNet+Vanilla 0.333333 0.494453 0.810356 0.897837 23 ghc vo BiHNet+Reg 0.435155 0.471446 0.891330 0.899036 23 ghc vo BiHNet+EWC 0.324538 0.488114 0.735190 0.890318 - ghc vo Adapter-Multitask 0.492221 - 0.902838 - - ghc vo BiHNet-Multitask 0.430180 - 0.887518 - Continued on next page 92 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page order dataset task model final-f1 instant-f1 final-auc instant-auc 24 abusive abusive Adapter-Vanilla 0.237068 0.909381 0.637408 0.975141 24 abusive abusive BiHNet+Vanilla 0.296675 0.906077 0.784075 0.974635 24 abusive abusive BiHNet+Reg 0.891249 0.897924 0.966972 0.972513 24 abusive abusive BiHNet+EWC 0.296176 0.905965 0.681150 0.975408 - abusive abusive Adapter-Multitask 0.902729 - 0.974823 - - abusive abusive BiHNet-Multitask 0.868651 - 0.940765 - 25 PA ra Adapter-Vanilla 0.439443 0.746765 0.822923 0.972592 25 PA ra BiHNet+Vanilla 0.521540 0.750300 0.881657 0.974125 25 PA ra BiHNet+Reg 0.728748 0.743187 0.966885 0.972671 25 PA ra BiHNet+EWC 0.440975 0.741830 0.851548 0.974420 - PA ra Adapter-Multitask 0.728530 - 0.969852 - - PA ra BiHNet-Multitask 0.668837 - 0.955089 - 26 cad identity Adapter-Vanilla 0.349686 0.352399 0.759334 0.780956 26 cad identity BiHNet+Vanilla 0.396285 0.405063 0.784712 0.800906 26 cad identity BiHNet+Reg 0.390533 0.396292 0.791699 0.799686 26 cad identity BiHNet+EWC 0.369469 0.390764 0.740702 0.802461 - cad identity Adapter-Multitask 0.369803 - 0.781649 - - cad identity BiHNet-Multitask 0.292017 - 0.757460 - Continued on next page Table B.7: Instant and final AUC and F1 scores for upstream tasks for the random order experiment dataset task model few-shot-auc few-shot-f1 BAD2 - Adapter-Vanilla 0.626491 0.475584 BAD2 - BiHNet+Vanilla 0.591835 0.442589 BAD2 - BiHNet+Reg 0.627312 0.469799 BAD2 - BiHNet+EWC 0.624396 0.483940 Continued on next page 93 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 BAD2 - Adapter-Multitask 0.643871 0.492441 BAD2 - BiHNet-Multitask 0.661902 0.482916 BAD4 - Adapter-Vanilla 0.590429 0.335484 BAD4 - BiHNet+Vanilla 0.560764 0.404692 BAD4 - BiHNet+Reg 0.591853 0.445521 BAD4 - BiHNet+EWC 0.623405 0.448454 BAD4 - Adapter-Multitask 0.628114 0.482385 BAD4 - BiHNet-Multitask 0.637908 0.474747 cad counterspeech Adapter-Vanilla 0.947467 0.004090 cad counterspeech BiHNet+Vanilla 0.940275 0.004717 cad counterspeech BiHNet+Reg 0.994684 0.003210 cad counterspeech BiHNet+EWC 0.890557 0.004376 cad counterspeech Adapter-Multitask 0.973734 0.003040 cad counterspeech BiHNet-Multitask 0.933083 0.004785 cmsb sexist Adapter-Vanilla 0.800860 0.401189 cmsb sexist BiHNet+Vanilla 0.791143 0.428305 cmsb sexist BiHNet+Reg 0.847109 0.464678 cmsb sexist BiHNet+EWC 0.788794 0.433862 cmsb sexist Adapter-Multitask 0.838390 0.458685 cmsb sexist BiHNet-Multitask 0.858623 0.487342 conan disabled Adapter-Vanilla 0.904717 0.413793 conan disabled BiHNet+Vanilla 0.971757 0.424242 conan disabled BiHNet+Reg 0.970236 0.500000 conan disabled BiHNet+EWC 0.964673 0.451613 conan disabled Adapter-Multitask 0.988589 0.555556 conan disabled BiHNet-Multitask 0.932389 0.344262 conan jews Adapter-Vanilla 0.929167 0.606452 Continued on next page 94 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 conan jews BiHNet+Vanilla 0.916136 0.563830 conan jews BiHNet+Reg 0.986761 0.814286 conan jews BiHNet+EWC 0.955000 0.658683 conan jews Adapter-Multitask 0.971250 0.769231 conan jews BiHNet-Multitask 0.911648 0.625000 conan lgbt Adapter-Vanilla 0.826356 0.436975 conan lgbt BiHNet+Vanilla 0.841163 0.455446 conan lgbt BiHNet+Reg 0.890511 0.426230 conan lgbt BiHNet+EWC 0.726521 0.318519 conan lgbt Adapter-Multitask 0.876452 0.448430 conan lgbt BiHNet-Multitask 0.864446 0.454148 conan migrant Adapter-Vanilla 0.937178 0.787879 conan migrant BiHNet+Vanilla 0.933143 0.764706 conan migrant BiHNet+Reg 0.948523 0.783019 conan migrant BiHNet+EWC 0.889955 0.616601 conan migrant Adapter-Multitask 0.961840 0.833333 conan migrant BiHNet-Multitask 0.925652 0.697248 conan muslims Adapter-Vanilla 0.973152 0.869863 conan muslims BiHNet+Vanilla 0.961423 0.807818 conan muslims BiHNet+Reg 0.966340 0.835017 conan muslims BiHNet+EWC 0.946108 0.762500 conan muslims Adapter-Multitask 0.987032 0.880795 conan muslims BiHNet-Multitask 0.953043 0.845361 conan poc Adapter-Vanilla 0.705530 0.242105 conan poc BiHNet+Vanilla 0.930292 0.492063 conan poc BiHNet+Reg 0.848664 0.309524 conan poc BiHNet+EWC 0.856897 0.400000 Continued on next page 95 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 conan poc Adapter-Multitask 0.907496 0.394737 conan poc BiHNet-Multitask 0.757419 0.259740 conan woman Adapter-Vanilla 0.945992 0.659091 conan woman BiHNet+Vanilla 0.927384 0.629213 conan woman BiHNet+Reg 0.921676 0.744828 conan woman BiHNet+EWC 0.938102 0.608696 conan woman Adapter-Multitask 0.982824 0.745562 conan woman BiHNet-Multitask 0.898216 0.612022 dygen african Adapter-Vanilla 0.697561 0.031546 dygen african BiHNet+Vanilla 0.889696 0.043103 dygen african BiHNet+Reg 0.822976 0.032895 dygen african BiHNet+EWC 0.789526 0.028846 dygen african Adapter-Multitask 0.791274 0.031496 dygen african BiHNet-Multitask 0.894539 0.030848 dygen animosity Adapter-Vanilla 0.545165 0.164412 dygen animosity BiHNet+Vanilla 0.553239 0.164929 dygen animosity BiHNet+Reg 0.556119 0.166000 dygen animosity BiHNet+EWC 0.541385 0.156479 dygen animosity Adapter-Multitask 0.528676 0.157377 dygen animosity BiHNet-Multitask 0.577321 0.181818 dygen arab Adapter-Vanilla 0.706551 0.048900 dygen arab BiHNet+Vanilla 0.684826 0.043584 dygen arab BiHNet+Reg 0.771614 0.061776 dygen arab BiHNet+EWC 0.673449 0.043222 dygen arab Adapter-Multitask 0.720759 0.061135 dygen arab BiHNet-Multitask 0.769525 0.055470 dygen asi Adapter-Vanilla 0.722597 0.021341 Continued on next page 96 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen asi BiHNet+Vanilla 0.602426 0.016985 dygen asi BiHNet+Reg 0.680983 0.016416 dygen asi BiHNet+EWC 0.639644 0.018154 dygen asi Adapter-Multitask 0.637484 0.013106 dygen asi BiHNet-Multitask 0.672150 0.018490 dygen asi.chin Adapter-Vanilla 0.684886 0.040449 dygen asi.chin BiHNet+Vanilla 0.822891 0.050505 dygen asi.chin BiHNet+Reg 0.900363 0.057221 dygen asi.chin BiHNet+EWC 0.740221 0.048408 dygen asi.chin Adapter-Multitask 0.750432 0.040080 dygen asi.chin BiHNet-Multitask 0.813962 0.046875 dygen asi.east Adapter-Vanilla 0.599577 0.017668 dygen asi.east BiHNet+Vanilla 0.719864 0.032698 dygen asi.east BiHNet+Reg 0.792294 0.062257 dygen asi.east BiHNet+EWC 0.738057 0.031034 dygen asi.east Adapter-Multitask 0.566423 0.021692 dygen asi.east BiHNet-Multitask 0.673008 0.022508 dygen asi.south Adapter-Vanilla 0.694890 0.060086 dygen asi.south BiHNet+Vanilla 0.670054 0.050000 dygen asi.south BiHNet+Reg 0.820420 0.086275 dygen asi.south BiHNet+EWC 0.669341 0.057803 dygen asi.south Adapter-Multitask 0.804298 0.065906 dygen asi.south BiHNet-Multitask 0.702177 0.055749 dygen asylum Adapter-Vanilla 0.741776 0.010909 dygen asylum BiHNet+Vanilla 0.818531 0.013187 dygen asylum BiHNet+Reg 0.913690 0.026549 dygen asylum BiHNet+EWC 0.704966 0.013015 Continued on next page 97 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen asylum Adapter-Multitask 0.841792 0.011976 dygen asylum BiHNet-Multitask 0.959743 0.027211 dygen bla Adapter-Vanilla 0.663344 0.218642 dygen bla BiHNet+Vanilla 0.676250 0.214612 dygen bla BiHNet+Reg 0.783386 0.273713 dygen bla BiHNet+EWC 0.662496 0.197213 dygen bla Adapter-Multitask 0.743135 0.222460 dygen bla BiHNet-Multitask 0.769149 0.235669 dygen bla.man Adapter-Vanilla 0.843789 0.021505 dygen bla.man BiHNet+Vanilla 0.853931 0.032680 dygen bla.man BiHNet+Reg 0.913739 0.022346 dygen bla.man BiHNet+EWC 0.826485 0.018116 dygen bla.man Adapter-Multitask 0.914314 0.020374 dygen bla.man BiHNet-Multitask 0.817650 0.019305 dygen bla.wom Adapter-Vanilla 0.886206 0.046218 dygen bla.wom BiHNet+Vanilla 0.713370 0.025974 dygen bla.wom BiHNet+Reg 0.865667 0.033537 dygen bla.wom BiHNet+EWC 0.740031 0.033028 dygen bla.wom Adapter-Multitask 0.869987 0.034321 dygen bla.wom BiHNet-Multitask 0.796928 0.024691 dygen dehumanization Adapter-Vanilla 0.763208 0.142857 dygen dehumanization BiHNet+Vanilla 0.746079 0.151111 dygen dehumanization BiHNet+Reg 0.790485 0.160643 dygen dehumanization BiHNet+EWC 0.739724 0.129693 dygen dehumanization Adapter-Multitask 0.723382 0.117130 dygen dehumanization BiHNet-Multitask 0.727210 0.130159 dygen derogation Adapter-Vanilla 0.589725 0.455206 Continued on next page 98 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen derogation BiHNet+Vanilla 0.576981 0.459941 dygen derogation BiHNet+Reg 0.651349 0.545455 dygen derogation BiHNet+EWC 0.591059 0.495477 dygen derogation Adapter-Multitask 0.596901 0.507422 dygen derogation BiHNet-Multitask 0.692075 0.578187 dygen dis Adapter-Vanilla 0.664966 0.094241 dygen dis BiHNet+Vanilla 0.653491 0.087855 dygen dis BiHNet+Reg 0.794327 0.111288 dygen dis BiHNet+EWC 0.626324 0.085202 dygen dis Adapter-Multitask 0.684887 0.091082 dygen dis BiHNet-Multitask 0.726102 0.124748 dygen for Adapter-Vanilla 0.833637 0.047970 dygen for BiHNet+Vanilla 0.725930 0.039927 dygen for BiHNet+Reg 0.929193 0.107023 dygen for BiHNet+EWC 0.769685 0.036474 dygen for Adapter-Multitask 0.832336 0.055202 dygen for BiHNet-Multitask 0.903980 0.076372 dygen gay Adapter-Vanilla 0.813890 0.081784 dygen gay BiHNet+Vanilla 0.721734 0.075269 dygen gay BiHNet+Reg 0.805713 0.076312 dygen gay BiHNet+EWC 0.734685 0.079681 dygen gay Adapter-Multitask 0.875041 0.097087 dygen gay BiHNet-Multitask 0.826741 0.081169 dygen gay.man Adapter-Vanilla 0.719518 0.056338 dygen gay.man BiHNet+Vanilla 0.671613 0.050633 dygen gay.man BiHNet+Reg 0.677750 0.039052 dygen gay.man BiHNet+EWC 0.669622 0.044304 Continued on next page 99 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen gay.man Adapter-Multitask 0.751199 0.047478 dygen gay.man BiHNet-Multitask 0.669411 0.039216 dygen gay.wom Adapter-Vanilla 0.653895 0.048780 dygen gay.wom BiHNet+Vanilla 0.578229 0.037037 dygen gay.wom BiHNet+Reg 0.682982 0.060302 dygen gay.wom BiHNet+EWC 0.640716 0.039634 dygen gay.wom Adapter-Multitask 0.696146 0.045296 dygen gay.wom BiHNet-Multitask 0.763081 0.058027 dygen gendermin Adapter-Vanilla 0.688054 0.024578 dygen gendermin BiHNet+Vanilla 0.711625 0.021362 dygen gendermin BiHNet+Reg 0.842811 0.029173 dygen gendermin BiHNet+EWC 0.639510 0.021116 dygen gendermin Adapter-Multitask 0.880199 0.035587 dygen gendermin BiHNet-Multitask 0.790749 0.029173 dygen immig Adapter-Vanilla 0.743909 0.083019 dygen immig BiHNet+Vanilla 0.781631 0.144828 dygen immig BiHNet+Reg 0.821696 0.170492 dygen immig BiHNet+EWC 0.708115 0.078704 dygen immig Adapter-Multitask 0.840829 0.120000 dygen immig BiHNet-Multitask 0.771645 0.093700 dygen indig Adapter-Vanilla 0.817480 0.033195 dygen indig BiHNet+Vanilla 0.718626 0.024263 dygen indig BiHNet+Reg 0.800475 0.040201 dygen indig BiHNet+EWC 0.847406 0.038278 dygen indig Adapter-Multitask 0.917906 0.043689 dygen indig BiHNet-Multitask 0.766115 0.022191 dygen jew Adapter-Vanilla 0.786166 0.118902 Continued on next page 100 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen jew BiHNet+Vanilla 0.781324 0.146597 dygen jew BiHNet+Reg 0.846148 0.200000 dygen jew BiHNet+EWC 0.839360 0.169133 dygen jew Adapter-Multitask 0.784537 0.129713 dygen jew BiHNet-Multitask 0.774725 0.106667 dygen mixed.race Adapter-Vanilla 0.531906 0.019569 dygen mixed.race BiHNet+Vanilla 0.646306 0.022857 dygen mixed.race BiHNet+Reg 0.555626 0.017429 dygen mixed.race BiHNet+EWC 0.611304 0.029412 dygen mixed.race Adapter-Multitask 0.558827 0.016863 dygen mixed.race BiHNet-Multitask 0.638592 0.023468 dygen mus Adapter-Vanilla 0.755388 0.135472 dygen mus BiHNet+Vanilla 0.797697 0.148014 dygen mus BiHNet+Reg 0.765743 0.122754 dygen mus BiHNet+EWC 0.772548 0.143113 dygen mus Adapter-Multitask 0.816584 0.150289 dygen mus BiHNet-Multitask 0.698485 0.104031 dygen mus.wom Adapter-Vanilla 0.645392 0.016438 dygen mus.wom BiHNet+Vanilla 0.717868 0.010417 dygen mus.wom BiHNet+Reg 0.833229 0.014545 dygen mus.wom BiHNet+EWC 0.736740 0.018059 dygen mus.wom Adapter-Multitask 0.766520 0.016807 dygen mus.wom BiHNet-Multitask 0.758558 0.012945 dygen non.white Adapter-Vanilla 0.824000 0.061093 dygen non.white BiHNet+Vanilla 0.696062 0.056604 dygen non.white BiHNet+Reg 0.824159 0.070866 dygen non.white BiHNet+EWC 0.801129 0.068100 Continued on next page 101 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen non.white Adapter-Multitask 0.838850 0.058925 dygen non.white BiHNet-Multitask 0.839195 0.076577 dygen ref Adapter-Vanilla 0.834419 0.098039 dygen ref BiHNet+Vanilla 0.868346 0.123348 dygen ref BiHNet+Reg 0.788232 0.068966 dygen ref BiHNet+EWC 0.814017 0.076923 dygen ref Adapter-Multitask 0.908773 0.126482 dygen ref BiHNet-Multitask 0.856012 0.095745 dygen support Adapter-Vanilla 0.606195 0.013962 dygen support BiHNet+Vanilla 0.794912 0.060606 dygen support BiHNet+Reg 0.451712 0.007207 dygen support BiHNet+EWC 0.682239 0.016563 dygen support Adapter-Multitask 0.696765 0.017021 dygen support BiHNet-Multitask 0.740696 0.021645 dygen threatening Adapter-Vanilla 0.852452 0.139013 dygen threatening BiHNet+Vanilla 0.793205 0.112735 dygen threatening BiHNet+Reg 0.798413 0.113725 dygen threatening BiHNet+EWC 0.810625 0.136709 dygen threatening Adapter-Multitask 0.882179 0.145631 dygen threatening BiHNet-Multitask 0.866154 0.121008 dygen trans Adapter-Vanilla 0.558231 0.096525 dygen trans BiHNet+Vanilla 0.619845 0.106538 dygen trans BiHNet+Reg 0.817006 0.146132 dygen trans BiHNet+EWC 0.615229 0.093352 dygen trans Adapter-Multitask 0.735171 0.135189 dygen trans BiHNet-Multitask 0.714170 0.124077 dygen trav Adapter-Vanilla 0.646662 0.020243 Continued on next page 102 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 dygen trav BiHNet+Vanilla 0.564392 0.021053 dygen trav BiHNet+Reg 0.762115 0.029350 dygen trav BiHNet+EWC 0.611448 0.023576 dygen trav Adapter-Multitask 0.664540 0.028169 dygen trav BiHNet-Multitask 0.606042 0.022814 dygen wom Adapter-Vanilla 0.666830 0.191529 dygen wom BiHNet+Vanilla 0.772368 0.252459 dygen wom BiHNet+Reg 0.849288 0.369515 dygen wom BiHNet+EWC 0.702072 0.194139 dygen wom Adapter-Multitask 0.769987 0.248322 dygen wom BiHNet-Multitask 0.757370 0.227474 ghc cv Adapter-Vanilla 0.812127 0.062893 ghc cv BiHNet+Vanilla 0.781179 0.062500 ghc cv BiHNet+Reg 0.838447 0.060403 ghc cv BiHNet+EWC 0.824924 0.062176 ghc cv Adapter-Multitask 0.825069 0.072000 ghc cv BiHNet-Multitask 0.818089 0.045977 hatecheck black Adapter-Vanilla 0.789423 0.425000 hatecheck black BiHNet+Vanilla 0.843558 0.496552 hatecheck black BiHNet+Reg 0.931186 0.641791 hatecheck black BiHNet+EWC 0.876891 0.448087 hatecheck black Adapter-Multitask 0.926859 0.552632 hatecheck black BiHNet-Multitask 0.856827 0.426230 hatecheck disabled Adapter-Vanilla 0.886520 0.507463 hatecheck disabled BiHNet+Vanilla 0.880580 0.624204 hatecheck disabled BiHNet+Reg 0.954725 0.870968 hatecheck disabled BiHNet+EWC 0.906063 0.584795 Continued on next page 103 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 hatecheck disabled Adapter-Multitask 0.965245 0.622222 hatecheck disabled BiHNet-Multitask 0.894543 0.538462 hatecheck gay Adapter-Vanilla 0.906400 0.512195 hatecheck gay BiHNet+Vanilla 0.932067 0.615385 hatecheck gay BiHNet+Reg 0.902274 0.517647 hatecheck gay BiHNet+EWC 0.890527 0.580645 hatecheck gay Adapter-Multitask 0.959058 0.646617 hatecheck gay BiHNet-Multitask 0.797588 0.413793 hatecheck hate Adapter-Vanilla 0.779787 0.742597 hatecheck hate BiHNet+Vanilla 0.711358 0.669704 hatecheck hate BiHNet+Reg 0.745539 0.738854 hatecheck hate BiHNet+EWC 0.768348 0.750000 hatecheck hate Adapter-Multitask 0.822555 0.786957 hatecheck hate BiHNet-Multitask 0.798437 0.806517 hatecheck immigrants Adapter-Vanilla 0.862502 0.502857 hatecheck immigrants BiHNet+Vanilla 0.919529 0.592593 hatecheck immigrants BiHNet+Reg 0.915845 0.704000 hatecheck immigrants BiHNet+EWC 0.842041 0.443114 hatecheck immigrants Adapter-Multitask 0.930885 0.502732 hatecheck immigrants BiHNet-Multitask 0.936488 0.615385 hatecheck muslims Adapter-Vanilla 0.909837 0.617647 hatecheck muslims BiHNet+Vanilla 0.929787 0.633094 hatecheck muslims BiHNet+Reg 0.940720 0.588235 hatecheck muslims BiHNet+EWC 0.923197 0.616438 hatecheck muslims Adapter-Multitask 0.937066 0.544218 hatecheck muslims BiHNet-Multitask 0.887850 0.545455 hatecheck trans Adapter-Vanilla 0.751396 0.291339 Continued on next page 104 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 hatecheck trans BiHNet+Vanilla 0.891404 0.561644 hatecheck trans BiHNet+Reg 0.940533 0.678899 hatecheck trans BiHNet+EWC 0.825156 0.395939 hatecheck trans Adapter-Multitask 0.876546 0.361991 hatecheck trans BiHNet-Multitask 0.851881 0.454545 hatecheck women Adapter-Vanilla 0.861084 0.485981 hatecheck women BiHNet+Vanilla 0.941924 0.681319 hatecheck women BiHNet+Reg 0.954110 0.747253 hatecheck women BiHNet+EWC 0.948801 0.609524 hatecheck women Adapter-Multitask 0.952622 0.646465 hatecheck women BiHNet-Multitask 0.860923 0.374269 misogyny - Adapter-Vanilla 0.803650 0.362264 misogyny - BiHNet+Vanilla 0.814446 0.380567 misogyny - BiHNet+Reg 0.853848 0.332248 misogyny - BiHNet+EWC 0.817719 0.335766 misogyny - Adapter-Multitask 0.858276 0.385185 misogyny - BiHNet-Multitask 0.832160 0.341137 multi - Adapter-Vanilla 0.643382 0.237037 multi - BiHNet+Vanilla 0.631730 0.215385 multi - BiHNet+Reg 0.592240 0.182062 multi - BiHNet+EWC 0.575144 0.184080 multi - Adapter-Multitask 0.632464 0.220779 multi - BiHNet-Multitask 0.625541 0.218023 single - Adapter-Vanilla 0.923063 0.618852 single - BiHNet+Vanilla 0.909798 0.554622 single - BiHNet+Reg 0.887218 0.483180 single - BiHNet+EWC 0.904630 0.562162 Continued on next page 105 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 single - Adapter-Multitask 0.958845 0.687747 single - BiHNet-Multitask 0.869882 0.502370 single-adversarial - Adapter-Vanilla 0.836229 0.521739 single-adversarial - BiHNet+Vanilla 0.768038 0.366355 single-adversarial - BiHNet+Reg 0.831907 0.490991 single-adversarial - BiHNet+EWC 0.846279 0.459770 single-adversarial - Adapter-Multitask 0.900268 0.592941 single-adversarial - BiHNet-Multitask 0.797279 0.402367 stormfront - Adapter-Vanilla 0.861921 0.794595 stormfront - BiHNet+Vanilla 0.862494 0.740113 stormfront - BiHNet+Reg 0.872769 0.779944 stormfront - BiHNet+EWC 0.834097 0.774869 stormfront - Adapter-Multitask 0.861880 0.776596 stormfront - BiHNet-Multitask 0.865701 0.754617 us-election hof Adapter-Vanilla 0.751050 0.293103 us-election hof BiHNet+Vanilla 0.633272 0.225166 us-election hof BiHNet+Reg 0.808955 0.385321 us-election hof BiHNet+EWC 0.739496 0.278788 us-election hof Adapter-Multitask 0.786699 0.333333 us-election hof BiHNet-Multitask 0.792411 0.297030 Table B.8: AUC and F1 scores for few-shot downstream tasks for the chronological experiment dataset task model few-shot-auc few-shot-f1 BAD2 - Single 0.635964 0.490090 BAD2 - Adapter-Single 0.654797 0.483221 BAD2 - BiHNet-Single 0.620018 0.467909 BAD2 - Adapter-Vanilla 0.678801 0.475962 Continued on next page 106 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 BAD2 - BiHNet+Vanilla 0.582984 0.435165 BAD2 - BiHNet+Reg 0.660194 0.491484 BAD2 - BiHNet+EWC 0.633916 0.470588 BAD2 - Adapter-Multitask 0.702097 0.514039 BAD2 - BiHNet-Multitask 0.714881 0.537445 BAD4 - Single 0.689085 0.469841 BAD4 - Adapter-Single 0.670554 0.455056 BAD4 - BiHNet-Single 0.661543 0.470270 BAD4 - Adapter-Vanilla 0.679876 0.468085 BAD4 - BiHNet+Vanilla 0.603978 0.454918 BAD4 - BiHNet+Reg 0.604742 0.447552 BAD4 - BiHNet+EWC 0.613064 0.438889 BAD4 - Adapter-Multitask 0.655514 0.455056 BAD4 - BiHNet-Multitask 0.639380 0.480447 CAD counterspeech Single 0.622264 0.002805 CAD counterspeech Adapter-Single 0.924328 0.004264 CAD counterspeech BiHNet-Single 0.636023 0.002685 CAD counterspeech Adapter-Vanilla 0.956223 0.005682 CAD counterspeech BiHNet+Vanilla 0.988743 0.004640 CAD counterspeech BiHNet+Reg 0.833646 0.003597 CAD counterspeech BiHNet+EWC 0.950907 0.005013 CAD counterspeech Adapter-Multitask 0.988743 0.006369 CAD counterspeech BiHNet-Multitask 0.931207 0.004535 CMSB sexist Single 0.832720 0.494071 CMSB sexist Adapter-Single 0.830289 0.483221 CMSB sexist BiHNet-Single 0.819125 0.464088 CMSB sexist Adapter-Vanilla 0.857568 0.510242 Continued on next page 107 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 CMSB sexist BiHNet+Vanilla 0.849790 0.509294 CMSB sexist BiHNet+Reg 0.855256 0.515625 CMSB sexist BiHNet+EWC 0.883429 0.549165 CMSB sexist Adapter-Multitask 0.878635 0.531835 CMSB sexist BiHNet-Multitask 0.843043 0.483926 CONAN disabled Single 0.995150 0.851064 CONAN disabled Adapter-Single 0.997623 0.933333 CONAN disabled BiHNet-Single 0.995626 0.637681 CONAN disabled Adapter-Vanilla 0.951217 0.478873 CONAN disabled BiHNet+Vanilla 0.918315 0.357895 CONAN disabled BiHNet+Reg 0.989730 0.458333 CONAN disabled BiHNet+EWC 0.940044 0.535211 CONAN disabled Adapter-Multitask 0.993343 0.666667 CONAN disabled BiHNet-Multitask 0.897062 0.295082 CONAN jews Single 0.994053 0.931034 CONAN jews Adapter-Single 0.992500 0.890625 CONAN jews BiHNet-Single 0.977670 0.775194 CONAN jews Adapter-Vanilla 0.973902 0.734694 CONAN jews BiHNet+Vanilla 0.931477 0.522936 CONAN jews BiHNet+Reg 0.953617 0.627907 CONAN jews BiHNet+EWC 0.960663 0.684932 CONAN jews Adapter-Multitask 0.978371 0.839695 CONAN jews BiHNet-Multitask 0.957102 0.548077 CONAN LGBT Single 0.912992 0.543353 CONAN LGBT Adapter-Single 0.935733 0.577540 CONAN LGBT BiHNet-Single 0.895403 0.539326 CONAN LGBT Adapter-Vanilla 0.925165 0.538071 Continued on next page 108 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 CONAN LGBT BiHNet+Vanilla 0.937694 0.533937 CONAN LGBT BiHNet+Reg 0.889820 0.453125 CONAN LGBT BiHNet+EWC 0.925165 0.537313 CONAN LGBT Adapter-Multitask 0.937451 0.529412 CONAN LGBT BiHNet-Multitask 0.854065 0.494253 CONAN migrant Single 0.977594 0.897297 CONAN migrant Adapter-Single 0.987959 0.913978 CONAN migrant BiHNet-Single 0.983447 0.900000 CONAN migrant Adapter-Vanilla 0.948639 0.789744 CONAN migrant BiHNet+Vanilla 0.914204 0.663755 CONAN migrant BiHNet+Reg 0.901016 0.653386 CONAN migrant BiHNet+EWC 0.906675 0.669456 CONAN migrant Adapter-Multitask 0.972875 0.841584 CONAN migrant BiHNet-Multitask 0.922146 0.664000 CONAN muslims Single 0.991436 0.877076 CONAN muslims Adapter-Single 0.990668 0.907216 CONAN muslims BiHNet-Single 0.992338 0.923077 CONAN muslims Adapter-Vanilla 0.991764 0.929577 CONAN muslims BiHNet+Vanilla 0.987902 0.858065 CONAN muslims BiHNet+Reg 0.957673 0.809211 CONAN muslims BiHNet+EWC 0.972783 0.854237 CONAN muslims Adapter-Multitask 0.993946 0.860841 CONAN muslims BiHNet-Multitask 0.977792 0.787879 CONAN people of color Single 0.885714 0.514851 CONAN people of color Adapter-Single 0.959324 0.782609 CONAN people of color BiHNet-Single 0.981198 0.777778 CONAN people of color Adapter-Vanilla 0.898925 0.692308 Continued on next page 109 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 CONAN people of color BiHNet+Vanilla 0.929555 0.560748 CONAN people of color BiHNet+Reg 0.889831 0.280374 CONAN people of color BiHNet+EWC 0.903195 0.376623 CONAN people of color Adapter-Multitask 0.935730 0.640000 CONAN people of color BiHNet-Multitask 0.916190 0.528302 CONAN woman Single 0.996055 0.870748 CONAN woman Adapter-Single 0.998638 0.891892 CONAN woman BiHNet-Single 0.995671 0.864865 CONAN woman Adapter-Vanilla 0.986123 0.849315 CONAN woman BiHNet+Vanilla 0.928379 0.645161 CONAN woman BiHNet+Reg 0.980048 0.738095 CONAN woman BiHNet+EWC 0.961720 0.648352 CONAN woman Adapter-Multitask 0.994484 0.881119 CONAN woman BiHNet-Multitask 0.971879 0.754717 Dygen African Single 0.709622 0.022642 Dygen African Adapter-Single 0.753744 0.023981 Dygen African BiHNet-Single 0.807282 0.016970 Dygen African Adapter-Vanilla 0.820106 0.036810 Dygen African BiHNet+Vanilla 0.760201 0.021008 Dygen African BiHNet+Reg 0.821272 0.027027 Dygen African BiHNet+EWC 0.782441 0.036630 Dygen African Adapter-Multitask 0.857950 0.040541 Dygen African BiHNet-Multitask 0.860730 0.023256 Dygen animosity Single 0.583085 0.180437 Dygen animosity Adapter-Single 0.561059 0.176707 Dygen animosity BiHNet-Single 0.506374 0.137174 Dygen animosity Adapter-Vanilla 0.564928 0.176871 Continued on next page 110 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen animosity BiHNet+Vanilla 0.575415 0.191136 Dygen animosity BiHNet+Reg 0.534618 0.168067 Dygen animosity BiHNet+EWC 0.577934 0.175299 Dygen animosity Adapter-Multitask 0.552231 0.168276 Dygen animosity BiHNet-Multitask 0.607637 0.193622 Dygen Arabs Single 0.635554 0.031128 Dygen Arabs Adapter-Single 0.675253 0.039457 Dygen Arabs BiHNet-Single 0.748829 0.062640 Dygen Arabs Adapter-Vanilla 0.808592 0.076503 Dygen Arabs BiHNet+Vanilla 0.735965 0.057851 Dygen Arabs BiHNet+Reg 0.636772 0.048780 Dygen Arabs BiHNet+EWC 0.801646 0.051051 Dygen Arabs Adapter-Multitask 0.834051 0.078329 Dygen Arabs BiHNet-Multitask 0.719747 0.040161 Dygen Asians Single 0.653580 0.034602 Dygen Asians Adapter-Single 0.683574 0.029940 Dygen Asians BiHNet-Single 0.688481 0.023437 Dygen Asians Adapter-Vanilla 0.846577 0.024024 Dygen Asians BiHNet+Vanilla 0.690327 0.016667 Dygen Asians BiHNet+Reg 0.742070 0.018817 Dygen Asians BiHNet+EWC 0.689384 0.016588 Dygen Asians Adapter-Multitask 0.785647 0.016760 Dygen Asians BiHNet-Multitask 0.641292 0.014134 Dygen Chinese people Single 0.783270 0.044543 Dygen Chinese people Adapter-Single 0.815762 0.050481 Dygen Chinese people BiHNet-Single 0.812867 0.039356 Dygen Chinese people Adapter-Vanilla 0.826175 0.044759 Continued on next page 111 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Chinese people BiHNet+Vanilla 0.843012 0.060606 Dygen Chinese people BiHNet+Reg 0.829689 0.057225 Dygen Chinese people BiHNet+EWC 0.816698 0.052369 Dygen Chinese people Adapter-Multitask 0.835825 0.042989 Dygen Chinese people BiHNet-Multitask 0.809353 0.041339 Dygen East Asians Single 0.692402 0.026871 Dygen East Asians Adapter-Single 0.746267 0.024161 Dygen East Asians BiHNet-Single 0.777790 0.061674 Dygen East Asians Adapter-Vanilla 0.709308 0.034884 Dygen East Asians BiHNet+Vanilla 0.760627 0.041667 Dygen East Asians BiHNet+Reg 0.677499 0.039437 Dygen East Asians BiHNet+EWC 0.703587 0.036000 Dygen East Asians Adapter-Multitask 0.824933 0.038647 Dygen East Asians BiHNet-Multitask 0.802792 0.036585 Dygen South Asians Single 0.684706 0.050251 Dygen South Asians Adapter-Single 0.665598 0.051583 Dygen South Asians BiHNet-Single 0.662986 0.079365 Dygen South Asians Adapter-Vanilla 0.780351 0.073702 Dygen South Asians BiHNet+Vanilla 0.733631 0.074675 Dygen South Asians BiHNet+Reg 0.747811 0.060790 Dygen South Asians BiHNet+EWC 0.714140 0.061281 Dygen South Asians Adapter-Multitask 0.738230 0.065574 Dygen South Asians BiHNet-Multitask 0.723940 0.062874 Dygen Asylum seekers Single 0.959743 0.053571 Dygen Asylum seekers Adapter-Single 0.897400 0.021583 Dygen Asylum seekers BiHNet-Single 0.786654 0.016854 Dygen Asylum seekers Adapter-Vanilla 0.767387 0.013072 Continued on next page 112 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Asylum seekers BiHNet+Vanilla 0.930999 0.016227 Dygen Asylum seekers BiHNet+Reg 0.875705 0.013187 Dygen Asylum seekers BiHNet+EWC 0.919956 0.019608 Dygen Asylum seekers Adapter-Multitask 0.843828 0.022901 Dygen Asylum seekers BiHNet-Multitask 0.956532 0.028777 Dygen Black people Single 0.748573 0.219591 Dygen Black people Adapter-Single 0.737509 0.248555 Dygen Black people BiHNet-Single 0.727815 0.234192 Dygen Black people Adapter-Vanilla 0.790263 0.255428 Dygen Black people BiHNet+Vanilla 0.739259 0.243959 Dygen Black people BiHNet+Reg 0.735536 0.238202 Dygen Black people BiHNet+EWC 0.711824 0.242321 Dygen Black people Adapter-Multitask 0.753437 0.237248 Dygen Black people BiHNet-Multitask 0.776706 0.230143 Dygen Black men Single 0.970776 0.023669 Dygen Black men Adapter-Single 0.818695 0.027397 Dygen Black men BiHNet-Single 0.820316 0.023419 Dygen Black men Adapter-Vanilla 0.912066 0.024390 Dygen Black men BiHNet+Vanilla 0.908616 0.019640 Dygen Black men BiHNet+Reg 0.908616 0.020374 Dygen Black men BiHNet+EWC 0.986930 0.022989 Dygen Black men Adapter-Multitask 0.950335 0.025157 Dygen Black men BiHNet-Multitask 0.957340 0.024896 Dygen Black women Single 0.796041 0.048193 Dygen Black women Adapter-Single 0.844900 0.044444 Dygen Black women BiHNet-Single 0.836289 0.039911 Dygen Black women Adapter-Vanilla 0.814120 0.036735 Continued on next page 113 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Black women BiHNet+Vanilla 0.828480 0.031936 Dygen Black women BiHNet+Reg 0.815092 0.029605 Dygen Black women BiHNet+EWC 0.796470 0.032454 Dygen Black women Adapter-Multitask 0.825734 0.034156 Dygen Black women BiHNet-Multitask 0.806968 0.037344 Dygen dehumanization Single 0.703067 0.175439 Dygen dehumanization Adapter-Single 0.653162 0.130233 Dygen dehumanization BiHNet-Single 0.729720 0.130719 Dygen dehumanization Adapter-Vanilla 0.803086 0.158654 Dygen dehumanization BiHNet+Vanilla 0.726701 0.129524 Dygen dehumanization BiHNet+Reg 0.730518 0.107981 Dygen dehumanization BiHNet+EWC 0.775381 0.165450 Dygen dehumanization Adapter-Multitask 0.839332 0.142649 Dygen dehumanization BiHNet-Multitask 0.778659 0.107505 Dygen derogation Single 0.514538 0.438830 Dygen derogation Adapter-Single 0.511880 0.483633 Dygen derogation BiHNet-Single 0.523676 0.464508 Dygen derogation Adapter-Vanilla 0.705633 0.566964 Dygen derogation BiHNet+Vanilla 0.702747 0.573463 Dygen derogation BiHNet+Reg 0.632040 0.539097 Dygen derogation BiHNet+EWC 0.706101 0.565619 Dygen derogation Adapter-Multitask 0.702820 0.587181 Dygen derogation BiHNet-Multitask 0.698568 0.566215 Dygen People with disabilities Single 0.656806 0.092555 Dygen People with disabilities Adapter-Single 0.683058 0.088962 Dygen People with disabilities BiHNet-Single 0.672755 0.085106 Dygen People with disabilities Adapter-Vanilla 0.764702 0.123404 Continued on next page 114 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen People with disabilities BiHNet+Vanilla 0.817699 0.201835 Dygen People with disabilities BiHNet+Reg 0.760772 0.130536 Dygen People with disabilities BiHNet+EWC 0.817542 0.156334 Dygen People with disabilities Adapter-Multitask 0.765716 0.145631 Dygen People with disabilities BiHNet-Multitask 0.719064 0.104167 Dygen Foreigners Single 0.865222 0.064368 Dygen Foreigners Adapter-Single 0.884991 0.057034 Dygen Foreigners BiHNet-Single 0.820148 0.054250 Dygen Foreigners Adapter-Vanilla 0.916614 0.078313 Dygen Foreigners BiHNet+Vanilla 0.910111 0.135266 Dygen Foreigners BiHNet+Reg 0.785367 0.041420 Dygen Foreigners BiHNet+EWC 0.908439 0.079027 Dygen Foreigners Adapter-Multitask 0.907064 0.064240 Dygen Foreigners BiHNet-Multitask 0.893594 0.065934 Dygen gay Single 0.875634 0.130031 Dygen gay Adapter-Single 0.833293 0.108911 Dygen gay BiHNet-Single 0.795869 0.080495 Dygen gay Adapter-Vanilla 0.856252 0.110843 Dygen gay BiHNet+Vanilla 0.919566 0.111801 Dygen gay BiHNet+Reg 0.876808 0.101053 Dygen gay BiHNet+EWC 0.889835 0.104208 Dygen gay Adapter-Multitask 0.892645 0.110132 Dygen gay BiHNet-Multitask 0.818323 0.065341 Dygen Gay men Single 0.654332 0.042169 Dygen Gay men Adapter-Single 0.645526 0.038633 Dygen Gay men BiHNet-Single 0.614690 0.031835 Dygen Gay men Adapter-Vanilla 0.756145 0.052142 Continued on next page 115 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Gay men BiHNet+Vanilla 0.759221 0.043302 Dygen Gay men BiHNet+Reg 0.737160 0.048696 Dygen Gay men BiHNet+EWC 0.748153 0.049689 Dygen Gay men Adapter-Multitask 0.796858 0.055738 Dygen Gay men BiHNet-Multitask 0.700956 0.042003 Dygen Gay women Single 0.575847 0.035961 Dygen Gay women Adapter-Single 0.558069 0.028694 Dygen Gay women BiHNet-Single 0.553636 0.032258 Dygen Gay women Adapter-Vanilla 0.768479 0.061176 Dygen Gay women BiHNet+Vanilla 0.740930 0.059754 Dygen Gay women BiHNet+Reg 0.744051 0.082474 Dygen Gay women BiHNet+EWC 0.635514 0.056206 Dygen Gay women Adapter-Multitask 0.799903 0.061758 Dygen Gay women BiHNet-Multitask 0.731807 0.038647 Dygen Gender minorities Single 0.852108 0.030905 Dygen Gender minorities Adapter-Single 0.795011 0.035794 Dygen Gender minorities BiHNet-Single 0.778906 0.027231 Dygen Gender minorities Adapter-Vanilla 0.868471 0.035461 Dygen Gender minorities BiHNet+Vanilla 0.730162 0.022670 Dygen Gender minorities BiHNet+Reg 0.780365 0.021053 Dygen Gender minorities BiHNet+EWC 0.871331 0.031696 Dygen Gender minorities Adapter-Multitask 0.868585 0.031949 Dygen Gender minorities BiHNet-Multitask 0.761714 0.025641 Dygen Immigrants Single 0.906365 0.182456 Dygen Immigrants Adapter-Single 0.845723 0.180602 Dygen Immigrants BiHNet-Single 0.780274 0.090909 Dygen Immigrants Adapter-Vanilla 0.811105 0.095552 Continued on next page 116 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Immigrants BiHNet+Vanilla 0.809194 0.103448 Dygen Immigrants BiHNet+Reg 0.808537 0.129032 Dygen Immigrants BiHNet+EWC 0.785020 0.089783 Dygen Immigrants Adapter-Multitask 0.816552 0.092399 Dygen Immigrants BiHNet-Multitask 0.815099 0.088685 Dygen indig Single 0.743278 0.040000 Dygen indig Adapter-Single 0.864376 0.050955 Dygen indig BiHNet-Single 0.879705 0.029126 Dygen indig Adapter-Vanilla 0.825127 0.026616 Dygen indig BiHNet+Vanilla 0.849291 0.029316 Dygen indig BiHNet+Reg 0.845764 0.029474 Dygen indig BiHNet+EWC 0.774495 0.026316 Dygen indig Adapter-Multitask 0.800475 0.027273 Dygen indig BiHNet-Multitask 0.869300 0.035635 Dygen Jewish people Single 0.695314 0.117871 Dygen Jewish people Adapter-Single 0.660048 0.091097 Dygen Jewish people BiHNet-Single 0.692381 0.126531 Dygen Jewish people Adapter-Vanilla 0.859924 0.156352 Dygen Jewish people BiHNet+Vanilla 0.770853 0.158664 Dygen Jewish people BiHNet+Reg 0.782482 0.129760 Dygen Jewish people BiHNet+EWC 0.788038 0.141491 Dygen Jewish people Adapter-Multitask 0.819858 0.139384 Dygen Jewish people BiHNet-Multitask 0.833787 0.126856 Dygen Mixed race Single 0.568220 0.017316 Dygen Mixed race Adapter-Single 0.592517 0.017544 Dygen Mixed race BiHNet-Single 0.497586 0.014388 Dygen Mixed race Adapter-Vanilla 0.699045 0.034146 Continued on next page 117 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Mixed race BiHNet+Vanilla 0.586744 0.019444 Dygen Mixed race BiHNet+Reg 0.682698 0.028571 Dygen Mixed race BiHNet+EWC 0.636702 0.019116 Dygen Mixed race Adapter-Multitask 0.694742 0.028807 Dygen Mixed race BiHNet-Multitask 0.671599 0.026906 Dygen Muslims Single 0.789611 0.106996 Dygen Muslims Adapter-Single 0.790257 0.120055 Dygen Muslims BiHNet-Single 0.739825 0.125000 Dygen Muslims Adapter-Vanilla 0.846611 0.152727 Dygen Muslims BiHNet+Vanilla 0.806735 0.122503 Dygen Muslims BiHNet+Reg 0.834092 0.191919 Dygen Muslims BiHNet+EWC 0.774975 0.142574 Dygen Muslims Adapter-Multitask 0.879749 0.168297 Dygen Muslims BiHNet-Multitask 0.817724 0.119948 Dygen Muslim women Single 0.714734 0.018265 Dygen Muslim women Adapter-Single 0.722132 0.021277 Dygen Muslim women BiHNet-Single 0.877367 0.023256 Dygen Muslim women Adapter-Vanilla 0.686270 0.017143 Dygen Muslim women BiHNet+Vanilla 0.619937 0.009756 Dygen Muslim women BiHNet+Reg 0.815172 0.015083 Dygen Muslim women BiHNet+EWC 0.939060 0.020367 Dygen Muslim women Adapter-Multitask 0.908840 0.031250 Dygen Muslim women BiHNet-Multitask 0.753292 0.010152 Dygen Non-whites Single 0.862599 0.095541 Dygen Non-whites Adapter-Single 0.851783 0.078534 Dygen Non-whites BiHNet-Single 0.824677 0.070796 Dygen Non-whites Adapter-Vanilla 0.827832 0.070640 Continued on next page 118 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Non-whites BiHNet+Vanilla 0.862513 0.070764 Dygen Non-whites BiHNet+Reg 0.880372 0.077079 Dygen Non-whites BiHNet+EWC 0.805565 0.069930 Dygen Non-whites Adapter-Multitask 0.888207 0.093923 Dygen Non-whites BiHNet-Multitask 0.853555 0.071571 Dygen Refguees Single 0.942489 0.223529 Dygen Refguees Adapter-Single 0.909316 0.142857 Dygen Refguees BiHNet-Single 0.827890 0.063670 Dygen Refguees Adapter-Vanilla 0.887150 0.125461 Dygen Refguees BiHNet+Vanilla 0.888220 0.091603 Dygen Refguees BiHNet+Reg 0.802226 0.082418 Dygen Refguees BiHNet+EWC 0.845984 0.080201 Dygen Refguees Adapter-Multitask 0.898429 0.143426 Dygen Refguees BiHNet-Multitask 0.867457 0.107595 Dygen support Single 0.730528 0.023256 Dygen support Adapter-Single 0.663866 0.021277 Dygen support BiHNet-Single 0.615421 0.012780 Dygen support Adapter-Vanilla 0.549388 0.009479 Dygen support BiHNet+Vanilla 0.568507 0.012005 Dygen support BiHNet+Reg 0.537178 0.010194 Dygen support BiHNet+EWC 0.528541 0.011655 Dygen support Adapter-Multitask 0.636856 0.017167 Dygen support BiHNet-Multitask 0.669362 0.024768 Dygen threatening Single 0.875585 0.177650 Dygen threatening Adapter-Single 0.836170 0.138889 Dygen threatening BiHNet-Single 0.790577 0.108659 Dygen threatening Adapter-Vanilla 0.901731 0.130360 Continued on next page 119 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen threatening BiHNet+Vanilla 0.835296 0.099010 Dygen threatening BiHNet+Reg 0.712324 0.081425 Dygen threatening BiHNet+EWC 0.864872 0.123077 Dygen threatening Adapter-Multitask 0.893550 0.140152 Dygen threatening BiHNet-Multitask 0.860865 0.109546 Dygen Trans people Single 0.694125 0.134293 Dygen Trans people Adapter-Single 0.729872 0.150538 Dygen Trans people BiHNet-Single 0.687860 0.119816 Dygen Trans people Adapter-Vanilla 0.748769 0.160584 Dygen Trans people BiHNet+Vanilla 0.765517 0.127080 Dygen Trans people BiHNet+Reg 0.764915 0.123810 Dygen Trans people BiHNet+EWC 0.790838 0.161100 Dygen Trans people Adapter-Multitask 0.803334 0.166329 Dygen Trans people BiHNet-Multitask 0.747644 0.122754 Dygen Travellers Single 0.669575 0.021668 Dygen Travellers Adapter-Single 0.706848 0.023585 Dygen Travellers BiHNet-Single 0.766577 0.032941 Dygen Travellers Adapter-Vanilla 0.670805 0.028169 Dygen Travellers BiHNet+Vanilla 0.697895 0.026465 Dygen Travellers BiHNet+Reg 0.653241 0.020654 Dygen Travellers BiHNet+EWC 0.734996 0.027184 Dygen Travellers Adapter-Multitask 0.649694 0.026144 Dygen Travellers BiHNet-Multitask 0.741318 0.022508 Dygen Women Single 0.756641 0.218409 Dygen Women Adapter-Single 0.852057 0.308998 Dygen Women BiHNet-Single 0.825839 0.273973 Dygen Women Adapter-Vanilla 0.841440 0.317797 Continued on next page 120 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 Dygen Women BiHNet+Vanilla 0.834226 0.322457 Dygen Women BiHNet+Reg 0.828297 0.278317 Dygen Women BiHNet+EWC 0.818255 0.274834 Dygen Women Adapter-Multitask 0.858158 0.344423 Dygen Women BiHNet-Multitask 0.791889 0.276094 GHC class for violence Single 0.641220 0.035088 GHC class for violence Adapter-Single 0.631671 0.026230 GHC class for violence BiHNet-Single 0.627405 0.026906 GHC class for violence Adapter-Vanilla 0.795453 0.042781 GHC class for violence BiHNet+Vanilla 0.728225 0.034115 GHC class for violence BiHNet+Reg 0.789855 0.047244 GHC class for violence BiHNet+EWC 0.757210 0.042827 GHC class for violence Adapter-Multitask 0.822064 0.052786 GHC class for violence BiHNet-Multitask 0.847850 0.055980 hatecheck black Single 0.967115 0.946237 hatecheck black Adapter-Single 0.956154 0.868687 hatecheck black BiHNet-Single 0.934679 0.831683 hatecheck black Adapter-Vanilla 0.944744 0.582781 hatecheck black BiHNet+Vanilla 0.956763 0.756303 hatecheck black BiHNet+Reg 0.966314 0.671642 hatecheck black BiHNet+EWC 0.930929 0.480874 hatecheck black Adapter-Multitask 0.928526 0.604317 hatecheck black BiHNet-Multitask 0.956154 0.573171 hatecheck disabled Single 0.990839 0.836066 hatecheck disabled Adapter-Single 0.985898 0.802920 hatecheck disabled BiHNet-Single 0.924412 0.571429 hatecheck disabled Adapter-Vanilla 0.993782 0.735484 Continued on next page 121 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 hatecheck disabled BiHNet+Vanilla 0.983344 0.666667 hatecheck disabled BiHNet+Reg 0.991395 0.741722 hatecheck disabled BiHNet+EWC 0.984177 0.750000 hatecheck disabled Adapter-Multitask 0.997058 0.881890 hatecheck disabled BiHNet-Multitask 0.941039 0.560847 hatecheck gay Single 0.972348 0.777778 hatecheck gay Adapter-Single 0.956538 0.687500 hatecheck gay BiHNet-Single 0.907722 0.537500 hatecheck gay Adapter-Vanilla 0.968758 0.739496 hatecheck gay BiHNet+Vanilla 0.953200 0.560510 hatecheck gay BiHNet+Reg 0.942996 0.578947 hatecheck gay BiHNet+EWC 0.918588 0.552632 hatecheck gay Adapter-Multitask 0.947909 0.701754 hatecheck gay BiHNet-Multitask 0.864985 0.450262 hatecheck hate Single 0.701328 0.430678 hatecheck hate Adapter-Single 0.717094 0.474286 hatecheck hate BiHNet-Single 0.727140 0.569231 hatecheck hate Adapter-Vanilla 0.815678 0.795876 hatecheck hate BiHNet+Vanilla 0.795384 0.836852 hatecheck hate BiHNet+Reg 0.764893 0.836364 hatecheck hate BiHNet+EWC 0.820777 0.839552 hatecheck hate Adapter-Multitask 0.902120 0.834061 hatecheck hate BiHNet-Multitask 0.846102 0.869718 hatecheck immigrants Single 0.979479 0.890909 hatecheck immigrants Adapter-Single 0.971380 0.857143 hatecheck immigrants BiHNet-Single 0.939898 0.708661 hatecheck immigrants Adapter-Vanilla 0.932347 0.637037 Continued on next page 122 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 hatecheck immigrants BiHNet+Vanilla 0.968518 0.750000 hatecheck immigrants BiHNet+Reg 0.937097 0.634483 hatecheck immigrants BiHNet+EWC 0.924857 0.600000 hatecheck immigrants Adapter-Multitask 0.968822 0.702290 hatecheck immigrants BiHNet-Multitask 0.971897 0.779661 hatecheck muslims Single 0.958333 0.714286 hatecheck muslims Adapter-Single 0.969806 0.643357 hatecheck muslims BiHNet-Single 0.912862 0.558659 hatecheck muslims Adapter-Vanilla 0.961359 0.647482 hatecheck muslims BiHNet+Vanilla 0.935897 0.620690 hatecheck muslims BiHNet+Reg 0.931943 0.656934 hatecheck muslims BiHNet+EWC 0.888779 0.523256 hatecheck muslims Adapter-Multitask 0.973820 0.717557 hatecheck muslims BiHNet-Multitask 0.903157 0.517241 hatecheck Trans people Single 0.937442 0.876404 hatecheck Trans people Adapter-Single 0.923348 0.716981 hatecheck Trans people BiHNet-Single 0.903304 0.645669 hatecheck Trans people Adapter-Vanilla 0.935780 0.491018 hatecheck Trans people BiHNet+Vanilla 0.916933 0.515723 hatecheck Trans people BiHNet+Reg 0.922750 0.557823 hatecheck Trans people BiHNet+EWC 0.917531 0.611940 hatecheck Trans people Adapter-Multitask 0.933852 0.515723 hatecheck Trans people BiHNet-Multitask 0.850020 0.397906 hatecheck women Single 0.946348 0.680851 hatecheck women Adapter-Single 0.963803 0.857143 hatecheck women BiHNet-Single 0.953789 0.891892 hatecheck women Adapter-Vanilla 0.928732 0.780488 Continued on next page 123 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 hatecheck women BiHNet+Vanilla 0.955639 0.550000 hatecheck women BiHNet+Reg 0.958494 0.839506 hatecheck women BiHNet+EWC 0.884371 0.418919 hatecheck women Adapter-Multitask 0.949163 0.750000 hatecheck women BiHNet-Multitask 0.927204 0.409639 misogyny - Single 0.822216 0.329032 misogyny - Adapter-Single 0.837551 0.334426 misogyny - BiHNet-Single 0.805479 0.322785 misogyny - Adapter-Vanilla 0.844372 0.395522 misogyny - BiHNet+Vanilla 0.839064 0.382671 misogyny - BiHNet+Reg 0.828667 0.335616 misogyny - BiHNet+EWC 0.848168 0.372760 misogyny - Adapter-Multitask 0.865112 0.396825 misogyny - BiHNet-Multitask 0.805919 0.327759 multi - Single 0.839205 0.401028 multi - Adapter-Single 0.709392 0.259740 multi - BiHNet-Single 0.642476 0.196636 multi - Adapter-Vanilla 0.617924 0.191589 multi - BiHNet+Vanilla 0.614951 0.215269 multi - BiHNet+Reg 0.616131 0.191702 multi - BiHNet+EWC 0.597265 0.195773 multi - Adapter-Multitask 0.674493 0.248244 multi - BiHNet-Multitask 0.623469 0.216086 single - Single 0.990007 0.852679 single - Adapter-Single 0.988204 0.871287 single - BiHNet-Single 0.965336 0.679856 single - Adapter-Vanilla 0.939223 0.629126 Continued on next page 124 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 single - BiHNet+Vanilla 0.888218 0.508744 single - BiHNet+Reg 0.927218 0.629771 single - BiHNet+EWC 0.932330 0.634051 single - Adapter-Multitask 0.969689 0.716904 single - BiHNet-Multitask 0.928502 0.606171 adversarial - Single 0.979721 0.837321 adversarial - Adapter-Single 0.977043 0.781726 adversarial - BiHNet-Single 0.954980 0.670232 adversarial - Adapter-Vanilla 0.857171 0.490196 adversarial - BiHNet+Vanilla 0.837839 0.439873 adversarial - BiHNet+Reg 0.859952 0.511149 adversarial - BiHNet+EWC 0.864196 0.520979 adversarial - Adapter-Multitask 0.912971 0.607803 adversarial - BiHNet-Multitask 0.838634 0.444444 stormfront - Single 0.844468 0.805897 stormfront - Adapter-Single 0.811555 0.766595 stormfront - BiHNet-Single 0.757382 0.709832 stormfront - Adapter-Vanilla 0.884122 0.733728 stormfront - BiHNet+Vanilla 0.848016 0.756032 stormfront - BiHNet+Reg 0.861334 0.776903 stormfront - BiHNet+EWC 0.854757 0.792929 stormfront - Adapter-Multitask 0.901288 0.810390 stormfront - BiHNet-Multitask 0.868593 0.757493 US-election hateful Single 0.668330 0.228571 US-election hateful Adapter-Single 0.664259 0.232558 US-election hateful BiHNet-Single 0.616334 0.224852 US-election hateful Adapter-Vanilla 0.761029 0.379747 Continued on next page 125 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Continued from previous page dataset task model few-shot-auc few-shot-f1 US-election hateful BiHNet+Vanilla 0.744485 0.296875 US-election hateful BiHNet+Reg 0.751641 0.357895 US-election hateful BiHNet+EWC 0.787684 0.314961 US-election hateful Adapter-Multitask 0.781250 0.408602 US-election hateful BiHNet-Multitask 0.788209 0.350877 Table B.9: AUC and F1 scores for few-shot downstream tasks for the random order experiment 126 Appendix C: Supplemental Material for Chapter 4 C.1 Hardware and Implementation Details All the experiments1 were conducted on an NVIDIA RTX A6000 with 48GB RAM. The entire experiment takes around 9 hours on a single GPU. We used a learning rate of 1e-4. For optimization, we used Adamw [114] using a L2 regularization of 0.01. C.2 Measuring WEIRDness Using the fixation index (FST), 131 quantified variations in cultural beliefs and behaviors across societies. Initially used in genetics for assessing differentiation among sub-populations, FST has been adapted to cultural psychology [29], serving to measure the deviation of cultural traits and assign a numerical value to cultural distances. The study significantly leveraged data from the World Values Survey (WVS) [81], a global initiative exploring the evolution of people’s values and beliefs. Through WVS, [131] investigated the diverse responses of individuals from various societies to a broad set of queries about values and beliefs. For each language, we assigned a WEIRDness score specific to the country from which the language’s corpus data was sourced. For instance, the corpus for Arabic was derived from tweets originating in Lebanon; therefore, we applied the WEIRDness score specific to Lebanon for this dataset. However, for the Greek and Portuguese datasets, we adapted our approach due to the unavailability of specific WEIRDness scores for Greece and Portugal. Instead, we selected the WEIRDness scores of geographically proximal countries: Macedonia’s score was used for the Greek dataset, and Spain’s score was applied to the Portuguese dataset. C.3 Linguistic Similarity Traditional indices like the Levenshtein distance [176], Jaccard similarity [116], and Cosine similarity [63] have significantly contributed to various linguistic applications, providing broad insights into text and content 1Source code available at https://github.com/Ali-Omrani/Culture-In-Offensive-Language-Detection 127 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation analysis. The Levenshtein distance is noted for its effectiveness in spelling correction and character-level analysis, Jaccard similarity in identifying word set overlaps for document comparisons, and Cosine similarity in gauging thematic content based on word frequency for information retrieval tasks. However, our research, which delves into the nuanced detection of offensive language across languages, demands a linguistic analysis that captures more than what these traditional metrics offer. Our used index enhances these foundational indices by incorporating advanced phonological, syntactic, and semantic analyses. This is crucial for comprehensively understanding the intricacies of offensive language within various linguistic and cultural contexts. Phonological sensitivity, a pivotal feature of this index, is instrumental in discerning subtle pronunciation or intonation differences that can significantly alter the meanings or connotations of words or phrases. For instance, homophones or words with similar sounds might have different meanings based on slight pronunciation nuances. Furthermore, the meaning or offensiveness of a word or phrase can change dramatically with intonation, such as in sarcasm or culturally specific jokes. Also, the same word can have different connotations across dialects or cultures based on pronunciation variations. The index’s proficiency in analyzing these phonological aspects enhances the accuracy of offensive content detection in diverse linguistic landscapes. Additionally, the index’s capacity for syntactic and semantic analysis ensures a deep understanding of sentence structures and the contextual meaning of phrases. This surpasses the capabilities of traditional indices and is particularly beneficial for interpreting idiomatic expressions, colloquial language, and contextdependent language use. For example, the index can accurately interpret idiomatic expressions that may carry meanings not directly inferable from the individual words and are often deeply embedded in cultural contexts. It can also discern contextual nuances, enabling more accurate detection and interpretation of offensive content that varies dramatically with context. C.4 Detailed Results Figure C.1 shows the detailed ∆(La, Lt) for all 105 language pairs. 128 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Figure C.1: The CLTL performance change between 105 language pairs. 129 Appendix D: Supplemental Material for Chapter 5 D.1 Hardware All the experiments1 were conducted on an NVIDIA RTX A6000 with 48GB RAM. Each epoch takes around 10 minutes to run on a single GPU. D.2 Detailed Results for Robustness to Noise LoRA F1 Fine-Tuning F1 Noise Train Val Train Val 0% 0.938 0.938 1.00 0.949 10% 0.845 0.934 0.991 0.939 20% 0.754 0.931 0.965 0.936 30% 0.662 0.925 0.955 0.934 40% 0.553 0.900 0.992 0.893 Table D.1: F1 scores of LoRA and fine-tuning on balanced IMDB dataset for various degrees of noise. 1Source code available at https://github.com/Ali-Omrani/PEFT-Label-Noise-Robustness 130 Bibliography [1] Ibrahim Abu Farha and Walid Magdy. 2020. Multitask Learning for Arabic Offensive Language and Hate-Speech Detection. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, Marseille, France, 86–90. https://aclanthology.org/2020.osact-1.14 [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. [3] Oshin Agarwal, Funda Durupınar, Norman I. Badler, and Ani Nenkova. 2019. Word Embeddings (Also) Encode Human Personality Stereotypes. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019). Association for Computational Linguistics, Minneapolis, Minnesota, 205–211. https://doi.org/10.18653/v1/S19-1023 [4] Areej Al-Hassan and Hmood Al-Dossari. 2019. Detection of hate speech in social networks: a survey on multilingual corpus. 6th international conference on computer science and information technology 10 (2019), 10–5121. [5] Bohdan Andrusyak, Mykhailo Rimel, and Roman Kern. 2018. Detection of Abusive Speech for Mixed Sociolects of Russian and Ukrainian Languages. In The 12th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018, Karlova Studanka, Czech Republic, December 7-9, 2018, Ales Hor´ak, Pavel Rychl´y, and Adam Rambousek (Eds.). Tribun EU, Karlova Studanka, Czech Republic, 77–84. [6] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. [7] Dennis Assenmacher, Marco Niemann, Kilian M¨uller, Moritz Seiler, Dennis Riehle, Heike Trautmann, and Heike Trautmann. 2021. RP-Mod & RP-Crowd: Moderator- and Crowd-Annotated German News Comment Datasets. https://datasets-benchmarks-proceedings.neurips.cc/paper_ files/paper/2021/file/c9e1074f5b3f9fc8ea15d152add07294-Paper-round2.pdf 131 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [8] Mohammad Atari, Jonathan Haidt, Jesse Graham, Sena Koleva, Sean T Stevens, and Morteza Dehghani. 2023. Morality beyond the WEIRD: How the nomological network of morality varies across cultures. [9] Atlantic. 2023. Scaling Trust. https://www.atlanticcouncil.org/in-depth-research-reports/ report/scaling-trust/ [10] Scott Atran, Douglas L Medin, and Norbert O Ross. 2005. The cultural mind: environmental decision making and cultural modeling within and across populations. Psychological review 112, 4 (2005), 744. [11] Albert Bandura and Richard H Walters. 1977. Social learning theory. [12] Roy F Baumeister and Eli J Finkel. 2010. Advanced social psychology: The state of the science. [13] Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. , 41–48 pages. [14] Mohit Bhardwaj, Md. Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2020. Hostility Detection Dataset in Hindi. arXiv:2011.03588 https://arxiv.org/abs/2011.03588 [15] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python. O’Reilly, Cambridge. [16] Su Lin Blodgett, Solon Barocas, Hal Daum´e III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5454– 5476. https://doi.org/10.18653/v1/2020.acl-main.485 [17] Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1004–1015. https://doi.org/10.18653/v1/2021.acl-long.81 [18] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems 29 (2016), 4349–4357. [19] Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Overview of the EVALITA 2018 Hate Speech Detection Task. https://api.semanticscholar.org/ CorpusID:56483142 [20] Robert Boyd and Peter J Richerson. 2005. The origin and evolution of cultures. 132 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [21] Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159. [22] Michael D Buhrmester, Martha Newson, Alexandra V´azquez, Wallisen Tadashi Hattori, and Harvey Whitehouse. 2018. Winning at any cost: Identity fusion, group essence, and maximizing ingroup advantage. , 500–516 pages. [23] Meagan Cahill, Katya Migacheva, Jirka Taylor, Matthew Williams, Pete Burnap, Amir Javed, Han Liu, Hui Lu, and Alex Sutherland. 2019. Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime. , 29 pages. [24] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186. [25] Aylin Caliskan and Molly Lewis. 2022. Social biases in word embeddings and their relation to human cognition. , 478–493 pages. [26] Yang Trista Cao, Anna Sotnikova, Hal Daum´e III, Rachel Rudinger, and Linda Zou. 2022. Theorygrounded measurement of US social stereotypes in English language models. [27] Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75. [28] Sergio Andr´es Casta˜no-Pulgar´ın, Natalia Su´arez-Betancur, Luz Magnolia Tilano Vega, and Harvey Mauricio Herrera L´opez. 2021. Internet, social media and online hate speech. Systematic review. Aggression and Violent Behavior 58 (2021), 101608. [29] Luigi Luca Cavalli-Sforza, Paolo Menozzi, and Alberto Piazza. 1994. The history and geography of human genes. [30] Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 1–14. https://doi.org/10.18653/v1/S17-2001 [31] Tessa ES Charlesworth, Aylin Caliskan, and Mahzarin R Banaji. 2022. Historical representations of social groups across 200 years of word embeddings from Google Books. Proceedings of the National Academy of Sciences 119, 28 (2022), e2121798119. [32] Jeffrey T Checkel. 2017. Socialization and violence: Introduction and framework. , 592–605 pages. [33] Matteo Cinelli, Gianmarco De Francisci Morales, Alessandro Galeazzi, Walter Quattrociocchi, and Michele Starnini. 2021. The echo chamber effect on social media. Proceedings of the National Academy of Sciences 118, 9 (2021), e2023301118. 133 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [34] CNN. 2021. Facebook has language blind spots around the world that allow hate speech to flourish. https://www.cnn.com/2021/10/26/tech/ facebook-papers-language-hate-speech-international/index.html [35] C¸ a˘grı C¸ ¨oltekin. 2020. A Corpus of Turkish Offensive Language on Social Media. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6174–6184. https://aclanthology.org/2020.lrec-1.758 [36] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747 [37] Amy JC Cuddy, Susan T Fiske, Virginia SY Kwan, Peter Glick, St´ephanie Demoulin, Jacques-Philippe Leyens, Michael Harris Bond, Jean-Claude Croizet, Naomi Ellemers, Ed Sleebos, et al. 2009. Stereotype content model across cultures: Towards universal similarities and some differences. British Journal of Social Psychology 48, 1 (2009), 1–33. [38] Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL Recognising Textual Entailment Challenge. In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment (Southampton, UK) (MLCW’05). Springer-Verlag, Berlin, Heidelberg, 177–190. https: //doi.org/10.1007/11736790_9 [39] Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. Data bootstrapping approaches to improve low resource abusive language detection for indic languages. , 32–42 pages. [40] Aida Davani, Mark D´ıaz, Dylan Baker, and Vinodkumar Prabhakaran. 2023. Disentangling Perceptions of Offensiveness: Cultural and Moral Correlates. [41] Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the International AAAI Conference on Web and Social Media 11, 1 (May 2017), 512–515. https://doi.org/10.1609/icwsm.v11i1. 14955 [42] Ona de Gibert, Naiara Perez, Aitor Garc´ıa-Pablos, and Montse Cuadros. 2018. Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Association for Computational Linguistics, Brussels, Belgium, 11–20. https://doi.org/ 10.18653/v1/W18-5102 134 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [43] Ona de Gibert, Naiara Perez, Aitor Garc´ıa-Pablos, and Montse Cuadros. 2018. Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Association for Computational Linguistics, Brussels, Belgium, 11–20. https://doi.org/ 10.18653/v1/W18-5102 [44] Cyprien de Masson D’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems 32 (2019), 1–10. [45] Jiawen Deng, Jingyan Zhou, Hao Sun, Chujie Zheng, Fei Mi, Helen Meng, and Minlie Huang. 2022. COLD: A Benchmark for Chinese Offensive Language Detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11580–11599. https://doi.org/10.18653/v1/2022.emnlp-main.796 [46] Sunipa Dev and Jeff M. Phillips. 2019. Attenuating Bias in Word vectors.. In AISTATS (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, Naha, Okinawa, Japan, 879–887. http://dblp.uni-trier.de/db/conf/aistats/aistats2019. html#DevP19 [47] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, New Orleans LA USA, 67–73. https://doi.org/10.1145/ 3278721.3278729 [48] William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005). Association for Computational Linguistics, Jeju Island, Korea, 9–16. https://aclanthology.org/I05-5002 [49] eLinguistics C. 2020. Quantifying the genetic proximity between languages. http://www. elinguistics.net/Compare_Languages.aspx Retrieved on February 1, 2024. [50] Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition 48, 1 (1993), 71–99. [51] Margherita Fanton, Helena Bonaldi, Serra Sinem Tekiro˘glu, and Marco Guerini. 2021. Human-in-theLoop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3226–3240. https://doi.org/10.18653/v1/2021.acl-long. 250 [52] Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. 2018. Overview of the Task on Automatic Misogyny Identification at IberEval 2018. Ibereval@ sepln 2150 (2018), 214–228. 135 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [53] Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu. 2002. A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition. Journal of personality and social psychology 82, 6 (2002), 878. [54] Antigoni Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. Proceedings of the International AAAI Conference on Web and Social Media 12, 1 (June 2018), 491–500. https://doi.org/10.1609/ icwsm.v12i1.14991 [55] Jonathan Frankle and Michael Carbin. 2018. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. [56] Kathleen C Fraser, Isar Nejadgholi, and Svetlana Kiritchenko. 2021. Understanding and Countering Stereotypes: A Computational Approach to the Stereotype Content Model. [57] Benoˆıt Fr´enay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25, 5 (2013), 845–869. [58] Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. 2023. On the effectiveness of parameter-efficient fine-tuning. , 12799–12807 pages. [59] Michele J Gelfand, Jana L Raver, Lisa Nishii, Lisa M Leslie, Janetta Lun, Beng Chong Lim, Lili Duan, Assaf Almaliach, Soon Ang, Jakobina Arnadottir, et al. 2011. Differences between tight and loose cultures: A 33-nation study. science 332, 6033 (2011), 1100–1104. [60] Gemini GeminiTeam, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. [61] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The Third PASCAL Recognizing Textual Entailment Challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague, 1–9. https: //aclanthology.org/W07-1401 [62] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing - RTE ’07. Association for Computational Linguistics, Prague, Czech Republic, 1. https://doi.org/10.3115/1654536.1654538 [63] Wael H Gomaa, Aly A Fahmy, et al. 2013. A survey of text similarity approaches. international journal of Computer Applications 68, 13 (2013), 13–18. 136 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [64] Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of the 2019 Workshop on Widening NLP. Association for Computational Linguistics, Florence, Italy, 60–63. https:// aclanthology.org/W19-3621 [65] A Gorbunova. 2022. GitHub - alla-g/toxicity-detection-thesis: Code and data for my thesis ”Automatic toxic comment detection in social media for Russian” — github.com. https://github.com/alla-g/ toxicity-detection-thesis/tree/main. [Accessed 31-01-2024]. [66] Lara Grimminger and Roman Klinger. 2021. Hate Towards the Political Opponent: A Twitter Corpus Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. Association for Computational Linguistics, Online, 171–180. https://aclanthology.org/ 2021.wassa-1.18 [67] Ella Guest, Bertie Vidgen, Alexandros Mittos, Nishanth Sastry, Gareth Tyson, and Helen Margetts. 2021. An Expert Annotated Dataset for the Detection of Online Misogyny. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1336–1350. https://doi.org/10.18653/ v1/2021.eacl-main.114 [68] Ceren G¨unsoy, Susan E Cross, Vanessa A Castillo, Ayse K Uskul, S Arzu Wasti, Phia S Salter, Pelin Gul, Adrienne Carter-Sowell, Af¸sar Yegin, Betul Altunsu, et al. 2023. Goal derailment and goal persistence in response to honor threats. Journal of Cross-Cultural Psychology 54, 3 (2023), 365–384. [69] Yue Guo, Yi Yang, and Ahmed Abbasi. 2022. Auto-debias: Debiasing masked language models with automated biased prompts. , 1012–1023 pages. [70] R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. 2006. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Vol. 7. Association for Computational Linguistics, Sydney, Australia, 1–9. [71] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2021. Towards a Unified View of Parameter-Efficient Transfer Learning. [72] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. https://openreview.net/forum?id= 0RDcd5Axok [73] Joseph Henrich, Steven J Heine, and Ara Norenzayan. 2010. The weirdest people in the world? Behavioral and brain sciences 33, 2-3 (2010), 61–83. 137 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [74] Geert Hofstede. 2001. Culture’s consequences: Comparing values, behaviors, institutions and organizations across nations. [75] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, Vol. 97. PMLR, PMLR, Long Beach, CA, USA, 2790–2799. [76] Dirk Hovy and Shannon L. Spruit. 2016. The Social Impact of Natural Language Processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 591–598. https: //doi.org/10.18653/v1/P16-2096 [77] Dirk Hovy and Diyi Yang. 2021. The Importance of Modeling Social Factors of Language: Theory and Practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 588–602. https://doi.org/10.18653/v1/2021.naacl-main.49 [78] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. https://openreview. net/forum?id=nZeVKeeFYf9 [79] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. LoRA: Low-Rank Adaptation of Large Language Models. [80] Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5254–5276. https://doi.org/10.18653/v1/2023.emnlp-main.319 [81] Ronald Inglehart, Miguel Basanez, Jaime Diez-Medrano, Loek Halman, and Ruud Luijkx. 2000. World values surveys and European values surveys, 1981-1984, 1990-1993, and 1995-1997. [82] Stanis law Jastrzebski, Damian Le´sniak, and Wojciech Marian Czarnecki. 2017. How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks. arXiv:1702.02170 [cs.CL] [83] Aiqi Jiang and Arkaitz Zubiaga. 2024. Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges. [84] Xisen Jin, Bill Yuchen Lin, Mohammad Rostami, and Xiang Ren. 2021. Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning. In Findings of the Association for 138 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 714–729. https://doi.org/10.18653/v1/2021.findings-emnlp.62 [85] Ishan Jindal, Daniel Pressel, Brian Lester, and Matthew Nokleby. 2019. An Effective Label Noise Model for DNN Text Classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3246–3256. https://doi.org/10.18653/v1/N19-1328 [86] Masahiro Kaneko and Danushka Bollegala. 2021. Debiasing Pre-trained Contextualised Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1256–1266. https: //doi.org/10.18653/v1/2021.eacl-main.107 [87] Masahiro Kaneko and Danushka Bollegala. 2021. Debiasing Pre-trained Contextualised Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1256–1266. https: //doi.org/10.18653/v1/2021.eacl-main.107 [88] Masahiro Kaneko, Danushka Bollegala, and Naoaki Okazaki. 2022. Debiasing Isn’t Enough! – on the Effectiveness of Debiasing MLMs and Their Social Biases in Downstream Tasks. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 1299–1310. https://aclanthology.org/2022. coling-1.111 [89] Prashant Kapil and Asif Ekbal. 2020. A deep neural network based multi-task learning approach to hate speech detection. Knowledge-Based Systems 210 (2020), 106458. [90] Prashant Kapil and Asif Ekbal. 2020. Leveraging Multi-domain, Heterogeneous Data using Deep Multitask Learning for Hate Speech Detection. In Proceedings of the 17th International Conference on Natural Language Processing (ICON). NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India, 491–500. https://aclanthology.org/2020.icon-main.66 [91] Zixuan Ke and Bing Liu. 2022. Continual learning of natural language processing tasks: A survey. [92] Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Leigh Yeh, Ali Omrani, Yehsong Kim, Kris Coombs, Shreya Havaldar, Gwenyth Portillo-Wightman, Elaine Gonzalez, et al. 2022. Introducing the Gab Hate Corpus: defining and applying hate-based rhetoric to social media posts at scale. Language Resources and Evaluation 56 (2022), 1–30. [93] Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. 2020. Contextualizing Hate Speech Classifiers with Post-hoc Explanation. , 5435–5442 pages. 139 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [94] Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4110–4124. https://doi.org/10. 18653/v1/2021.naacl-main.324 [95] Carita Kiili, Leena Laurinen, and Miika Marttunen. 2008. Students evaluating Internet sources: From versatile evaluators to uncritical readers. Journal of Educational Computing Research 39, 1 (2008), 75–95. [96] Tae-Yeol Kim, Debra L Shapiro, Karl Aquino, Vivien KG Lim, and Rebecca J Bennett. 2008. Workplace offense and victims’ reactions: the effects of victim-offender (dis) similarity, offense-type, and cultural differences. Journal of Organizational Behavior: The International Journal of Industrial, Occupational and Organizational Psychology and Behavior 29, 3 (2008), 415–433. [97] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526. [98] Kate Klonick. 2017. The new governors: The people, rules, and processes governing online speech. Harv. L. Rev. 131 (2017), 1598. [99] Alex Koch, Roland Imhoff, Ron Dotsch, Christian Unkelbach, and Hans Alves. 2016. The ABC of stereotypes about groups: Agency/socioeconomic success, conservative–progressive beliefs, and communion. Journal of personality and social psychology 110, 5 (2016), 675. [100] Kai A Krueger and Peter Dayan. 2009. Flexible shaping: How learning in small steps helps. Cognition 110, 3 (2009), 380–394. [101] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. 2018. Cleannet: Transfer learning for scalable image classifier training with label noise. , 5447–5456 pages. [102] Nayeon Lee, Chani Jung, and Alice Oh. 2023. Hate Speech Classifiers are Culturally Insensitive. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP). Association for Computational Linguistics, Dubrovnik, Croatia, 35–46. https://doi.org/10.18653/v1/2023. c3nlp-1.5 [103] Jo˜ao A. Leite, Diego F. Silva, Kalina Bontcheva, and Carolina Scarton. 2020. Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis. arXiv:2010.04543 [cs.CL] 140 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [104] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3045–3059. https://doi.org/10.18653/v1/2021.emnlp-main.243 [105] Hector Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning. AAAI Press, Rome, Italy, 552–561. [106] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703 [107] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. [108] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4582–4597. https://doi.org/10.18653/v1/2021. acl-long.353 [109] Daria Liakhovets, Mina Sch¨utz, Jaqueline B¨ock, Medina Andresel, Armin Kirchknopf, Andreas Babic, Djordje Slijepˇcevi´c, Jasmin Lampert, Alexander Schindler, and Matthias Zeppelzauer. 2022. Transfer Learning for Automatic Sexism Detection with Multilingual Transformer Models. [110] Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and LouisPhilippe Morency. 2020. Towards Debiasing Sentence Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5502–5515. https://doi.org/10.18653/v1/2020.acl-main.488 [111] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2021. Towards understanding and mitigating social biases in language models. , 6565–6576 pages. [112] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021. GPT understands, too. [113] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. [114] Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. 141 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [115] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. , 142–150 pages. [116] Alexander Maedche, Viktor Pekar, and Steffen Staab. 2002. Ontology learning part one—on discovering taxonomic relations from the web. , 301–319 pages. [117] Angela T Maitner, Diane M Mackie, Janet VT Pauketat, and Eliot R Smith. 2017. The impact of culture and identity on emotional reactions to insults. Journal of Cross-Cultural Psychology 48, 6 (2017), 892–913. [118] Thomas Manzini, Yao Chong Lim, Yulia Tsvetkov, and Alan W Black. 2019. Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. [119] Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. 2020. Multitask Learning Strengthens Adversarial Robustness. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 158–174. [120] Hazel Rose Markus and Shinobu Kitayama. 2010. Cultures and selves: A cycle of mutual constitution. Perspectives on psychological science 5, 4 (2010), 420–430. [121] Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in Sentence Encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 622–628. https://doi.org/10.18653/v1/N19-1063 [122] Don McNicol. 2005. A primer of signal detection theory. [123] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. [124] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs.CL] [125] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. , 3111–3119 pages. [126] Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A Dialog Research Software Platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Copenhagen, Denmark, 79–84. https://doi.org/10.18653/v1/ D17-2014 142 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [127] George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41. [128] Aida Mostafazadeh Davani, Mohammad Atari, Brendan Kennedy, and Morteza Dehghani. 2021. Hate Speech Classifiers Learn Human-Like Social Stereotypes. , arXiv–2110 pages. [129] Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, and Morteza Dehghani. 2021. Improving Counterfactual Generation for Fair Hate Speech Detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Association for Computational Linguistics, Online, 92–101. https://doi.org/10.18653/v1/2021.woah-1.10 [130] Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. , 111–118 pages. [131] Michael Muthukrishna, Adrian V Bell, Joseph Henrich, Cameron M Curtin, Alexander Gedranovich, Jason McInerney, and Braden Thue. 2020. Beyond Western, Educated, Industrial, Rich, and Democratic (WEIRD) psychology: Measuring and mapping scales of cultural and psychological distance. Psychological science 31, 6 (2020), 678–701. [132] Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5356–5371. https://doi.org/10.18653/v1/2021.acl-long.416 [133] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1953–1967. https://doi.org/10.18653/v1/2020.emnlp-main.154 [134] Raymond S Nickerson. 1998. Confirmation bias: A ubiquitous phenomenon in many guises. Review of general psychology 2, 2 (1998), 175–220. [135] Gandalf Nicolas, Xuechunzi Bai, and Susan T Fiske. 2021. Comprehensive stereotype content dictionaries using a semi-automated method. European Journal of Social Psychology 51, 1 (2021), 178–196. [136] Debora Nozza. 2021. Exposing the limits of Zero-shot Cross-lingual Hate Speech Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 907–914. https://doi.org/10.18653/v1/2021.acl-short.114 [137] Erida Nurce, Jorgel Keci, and Leon Derczynski. 2021. Detecting abusive Albanian. 143 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [138] Ali Omrani, Alireza Salkhordeh Ziabari, Charles Yu, Preni Golazizian, Brendan Kennedy, Mohammad Atari, Heng Ji, and Morteza Dehghani. 2023. Social-Group-Agnostic Bias Mitigation via the Stereotype Content Model. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 4123–4139. https://doi.org/10.18653/v1/2023.acl-long.227 [139] Ali Omrani, Alireza S. Ziabari, Preni Golazizian, Jeffrey Sorensen, and Morteza Dehghani. 2023. Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning. arXiv:2309.16905 [cs.CL] [140] Merrick R. Osborne, Ali Omrani, and Morteza Dehghani. 0. The Sins of the Parents Are to Be Laid Upon the Children: Biased Humans, Biased Data, Biased Models. Perspectives on Psychological Science 0, 0 (0), 17456916231180099. https://doi.org/10.1177/17456916231180099 arXiv:https://doi.org/10.1177/17456916231180099 PMID: 37462616. [141] Elinor Ostrom. 2000. Collective action and the evolution of social norms. Journal of economic perspectives 14, 3 (2000), 137–158. [142] Daphna Oyserman. 2011. Culture as situated cognition: Cultural mindsets, cultural fluency, and meaning making. European review of social psychology 22, 1 (2011), 164–214. [143] Daphna Oyserman. 2017. Culture three ways: Culture and subcultures within countries. Annual review of psychology 68 (2017), 435–463. [144] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural networks 113 (2019), 54–71. [145] Jonas Pfeiffer, Aishwarya Kamath, Andreas R¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. 2021. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 487–503. https://doi.org/10.18653/v1/2021. eacl-main.39 [146] Zesis Pitenis, Marcos Zampieri, and Tharindu Ranasinghe. 2020. Offensive Language Identification in Greek. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 5113–5119. https://aclanthology.org/2020. lrec-1.629 [147] Flor Miriam Plaza-Del-Arco, M Dolores Molina-Gonz´alez, L Alfonso Ure˜na-L´opez, and Mar´ıa Teresa Mart´ın-Valdivia. 2021. A multi-task learning approach to hate speech detection leveraging sentiment analysis. IEEE Access 9 (2021), 112478–112489. 144 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [148] Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation 55 (2021), 477–523. [149] Senja Pollak, Marko Robnik-Sikonja, Matthew Purver, Michele Boggia, Ravi Shekhar, Marko Pranji´c, ˇ Salla Salmela, Ivar Krustok, Tarmo Paju, Carl-Gustav Linden, Leo Lepp¨anen, Elaine Zosa, Matej Ulˇcar, Linda Freienthal, Silver Traat, Luis Adri´an Cabrera-Diego, Matej Martinc, Nada Lavraˇc, Blaˇz Skrlj, Martin ˇ Znidarˇsiˇc, Andraˇz Pelicon, Boshko Koloski, Vid Podpeˇcan, Janez Kranjc, Shane Shee- ˇ han, Emanuela Boros, Jose G. Moreno, Antoine Doucet, and Hannu Toivonen. 2021. EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. Association for Computational Linguistics, Online, 99–109. https://aclanthology.org/2021.hackashop-1.14 [150] Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul Musker, Maayan Roichman, Guillaume Sylvain, Nithum Thain, Lucas Dixon, and Jeffrey Sorensen. 2020. Six Attributes of Unhealthy Conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics, Online, 114–124. https://doi.org/10.18653/v1/2020.alw-1.15 [151] Jing Qian, Hong Wang, Mai ElSherief, and Xifeng Yan. 2021. Lifelong Learning of Hate Speech Classification on Social Media. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2304–2314. https://doi.org/10.18653/v1/2021.naacl-main.183 [152] Alan Ramponi and Sara Tonelli. 2022. Features or Spurious Artifacts? Data-centric Baselines for Fair and Robust Hate Speech Detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 3027–3040. https://doi.org/10.18653/v1/ 2022.naacl-main.221 [153] Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. [154] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks. , 3982–3992 pages. [155] Anthony Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7, 2 (1995), 123–146. [156] Paul R¨ottger, Debora Nozza, Federico Bianchi, and Dirk Hovy. 2022. Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5674–5691. https://doi.org/10.18653/v1/2022.emnlp-main. 383 145 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [157] Paul R¨ottger, Haitham Seelawi, Debora Nozza, Zeerak Talat, and Bertie Vidgen. 2022. Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics, Seattle, Washington (Hybrid), 154–169. https://doi.org/10.18653/v1/2022.woah-1.15 [158] Paul Rottger, Bertie Vidgen, Dirk Hovy, and Janet Pierrehumbert. 2022. Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 175–190. https://doi.org/10. 18653/v1/2022.naacl-main.13 [159] Paul R¨ottger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert. 2021. HateCheck: Functional Tests for Hate Speech Detection Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 41–58. https://doi.org/10.18653/v1/2021.acl-long.4 [160] Koustuv Saha, Eshwar Chandrasekharan, and Munmun De Choudhury. 2019. Prevalence and psychological effects of hateful speech in online college communities. , 255–264 pages. [161] Anna Schmidt and Michael Wiegand. 2017. A Survey on Hate Speech Detection using Natural Language Processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Valencia, Spain, 1–10. https://doi.org/ 10.18653/v1/W17-1101 [162] Mina Sch¨utz, Christoph Demus, Jonas Pitz, Nadine Probol, Melanie Siegel, and Dirk Labudde. 2021. DeTox at GermEval 2021: Toxic Comment Classification. In Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments. Association for Computational Linguistics, Duesseldorf, Germany, 54–61. https://aclanthology.org/2021.germeval-1.8 [163] Shalom Schwartz. 2006. A theory of cultural value orientations: Explication and applications. Comparative sociology 5, 2-3 (2006), 137–182. [164] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & compress: A scalable framework for continual learning. In International conference on machine learning. PMLR, PMLR, Stockholm, Sweden, 4528– 4537. [165] Ravi Shekhar, Marko Pranji´c, Senja Pollak, Andraˇz Pelicon, and Matthew Purver. 2020. Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian. Journal for Language Technology and Computational Linguistics 34, 1 (Jul. 2020), 49–79. https://doi.org/ 10.21248/jlcl.34.2020.224 146 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [166] Gudbjartur Ingi Sigurbergsson and Leon Derczynski. 2020. Offensive Language and Hate Speech Detection for Danish. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 3498–3508. https://aclanthology. org/2020.lrec-1.430 [167] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. EMNLP, Seattle, USA, 1631–1642. [168] Irene Solaiman and Christy Dennison. 2021. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems 34 (2021), 5861– 5873. [169] Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019. Selfie: Refurbishing unclean samples for robust deep learning. , 5907–5915 pages. [170] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2022. Learning from noisy labels with deep neural networks: A survey. [171] Karen Stollznow. 2020. On the offensive: Prejudice in language past and present. [172] Fatemeh Tahmasbi, Leonard Schild, Chen Ling, Jeremy Blackburn, Gianluca Stringhini, Yang Zhang, and Savvas Zannettou. 2021. “Go Eat a Bat, Chang!”: On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 1122–1133. https://doi.org/10.1145/3442381.3450024 [173] Henri Tajfel and John C Turner. 2004. The social identity theory of intergroup behavior. , 276– 293 pages. [174] Zeerak Talat, James Thorne, and Joachim Bingel. 2018. Correction to: Bridging the Gaps: Multi Task Learning for Domain Transfer of Hate Speech Entection. Online Harassment - (2018), C1–C1. [175] Michael T¨anzer, Sebastian Ruder, and Marek Rei. 2022. Memorisation versus Generalisation in Pretrained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 7564–7578. https://doi.org/10.18653/v1/2022.acl-long.521 [176] Jan D ten Thije and Ludger Zeevaert. 2007. Receptive multilingualism: Linguistic analyses, language policies and didactic concepts. 147 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [177] Cagri Toraman, Furkan S¸ahinu¸c, and Eyup Yilmaz. 2022. Large-Scale Hate Speech Detection with Cross-Domain Transfer. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 2215–2225. https: //aclanthology.org/2022.lrec-1.238 [178] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. [179] Harry C Triandis. 2018. Individualism and collectivism. [180] Alexander Tsesis. 2002. Destructive messages: How hate speech paves the way for harmful social movements. [181] Eddie L Ungless, Amy Rafferty, Hrichika Nag, and Bj¨orn Ross. 2022. A Robust Bias Mitigation Procedure Based on the Stereotype Content Model. [182] United Nations. 2023. UN News: Urgent need for more accountability from social media giants to curb hate speech: UN experts. https://news.un.org/en/story/2023/01/1132232. Accessed on 2023-10-10. [183] United States Senate Committee on the Judiciary. Jan 31st, 2024. Hearings to examine Big Tech and the online child sexual exploitation crisis. 118th Congress (2023-2024), Presiding: Chair Durbin, G50 Dirksen Senate Office Building, Washington, D.C.. [184] Bertie Vidgen and Leon Derczynski. 2020. Directions in abusive language training data, a systematic review: Garbage in, garbage out. Plos one 15, 12 (2020), e0243300. [185] Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, and Rebekah Tromble. 2021. Introducing CAD: the contextual abuse dataset. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Thailand, 2289–2303. [186] Bertie Vidgen, Tristan Thrush, Zeerak Waseem, and Douwe Kiela. 2021. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1667–1682. https://doi.org/10.18653/v1/2021.acl-long.132 [187] Jeremy Waldron. 2012. The harm in hate speech. [188] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 148 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/ 10.18653/v1/W18-5446 [189] Zirui Wang, Sanket Vaibhav Mehta, Barnabas Poczos, and Jaime Carbonell. 2020. Efficient Meta Lifelong-Learning with Limited Memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 535– 548. https://doi.org/10.18653/v1/2020.emnlp-main.39 [190] Zeerak Waseem and Dirk Hovy. 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In Proceedings of the NAACL Student Research Workshop. Association for Computational Linguistics, San Diego, California, 88–93. https://doi.org/10.18653/v1/ N16-2013 [191] Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. Detection of abusive language: the problem of biased datasets. , 602–608 pages. [192] Zihao Wu, Huy Tran, Hamed Pirsiavash, and Soheil Kolouri. 2023. Is Multi-Task Learning an Upper Bound for Continual Learning?. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Rhodes Island, Greece, 1–5. https://doi.org/10. 1109/ICASSP49357.2023.10095984 [193] Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1391–1399. https://doi.org/10.1145/3038912.3052591 [194] Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and L´eo Laugier. 2022. Toxicity detection sensitive to conversational context. [195] Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-Adversarial Dialogue for Safe Conversational Agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2950–2968. https://doi.org/10.18653/v1/2021. naacl-main.235 [196] Ke Yang, Charles Yu, Yi R. Fung, Manling Li, and Heng Ji. 2023. ADEPT: A DEbiasing PrompT Framework. Proceedings of the AAAI Conference on Artificial Intelligence 37, 9 (June 2023), 10780–10788. https://doi.org/10.1609/aaai.v37i9.26279 [197] Wenjie Yin and Arkaitz Zubiaga. 2021. Towards generalisable hate speech detection: a review on obstacles and solutions. PeerJ Computer Science 7 (2021), e598. 149 Countering Problematic Content in the Digital Space: Bias Reduction and Dynamic Content Adaptation [198] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation. Association for Computational Linguistics, Minneapolis, Minnesota, USA, 75–86. https://doi.org/10.18653/v1/ S19-2010 [199] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. [200] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 3 (2021), 107–115. [201] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4847–4853. https://doi. org/10.18653/v1/D18-1521 [202] Shuyan Zhou, Xiangkai Zeng, Yingqi Zhou, Antonios Anastasopoulos, and Graham Neubig. 2019. Improving Robustness of Neural Machine Translation with Multi-task Learning. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, Florence, Italy, 565–571. https://doi.org/10.18653/v1/W19-5368 [203] Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Adelani, and Dietrich Klakow. 2022. Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification. In Proceedings of the Third Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Dublin, Ireland, 62–67. https://doi.org/10.18653/v1/2022.insights-1.8 [204] Linda X Zou and Sapna Cheryan. 2017. Two axes of subordination: A new model of racial position. Journal of personality and social psychology 112, 5 (2017), 696. 150
Abstract (if available)
Abstract
Problematic content, such as hate speech, harassment, and negative stereotypes, poses a daunting challenge that can give rise to discrimination, prejudice, and exclusion, ultimately eroding inclusivity and psychological well-being. Both academic researchers and industry leaders have recognized the adverse impacts of problematic content and have developed various strategies and resources to understand and tackle them. Nevertheless, a significant portion of these endeavors neglects to recognize that problematic content fundamentally stems from societal and psychological issues. In this thesis, I argue that computational solutions for countering problematic content can be significantly improved by leveraging insights from social and cognitive psychology. I begin by discussing the social roots of problematic content and advocate for a strong partnership between social and computer scientists. This collaboration is crucial in developing effective strategies to address the challenges posed by the widespread presence of problematic content. Throughout my thesis, I demonstrate the practical application of this interdisciplinary approach by studying four key aspects of problematic content and creating adaptable solutions for combating such content in digital space using theory-motivated approaches that bridge language technology and social sciences.
Firstly, I will introduce my theory-driven, social-group-agnostic framework for mitigating biases in language models. With the increasing prevalence of Large Language Models (LLMs), there is a pressing need to acknowledge the potential risks associated with them generating stereotypical and problematic content. To ensure that these models do not further perpetuate the existing disparities in our society, it is crucial that our efforts to mitigate bias are all-encompassing. However, most existing bias mitigation approaches are social-group specific, which restricts their applicability to less-researched social groups. I propose that Stereotype Content Model (SCM) — a theoretical framework developed in social psychology for understanding the content of stereotyping — can help debiasing efforts to become social-group-agnostic by capturing the underlying connection between bias and stereotypes. I perform debiasing with established methods on both pre-trained word embeddings and large language models and demonstrate that my social-group-agnostic, SCM-based debiasing technique performs comparably to group-specific debiasing on multiple bias benchmarks, but has theoretical and practical advantages over existing approaches.
Secondly, I propose a novel approach aimed at capturing the dynamic nature of problematic online content, aiming to overcome the limitations of current static approaches. For instance, since such approaches often demand large annotated training data, researchers are frequently ill-prepared to respond to emerging cases. In response, I describe my novel continual learning framework and benchmark comprising two essential elements: initially, a series of upstream tasks designed to mimic the evolution of problematic content and gauge a model’s capacity for knowledge acquisition; and subsequently, a distinct collection of few-shot downstream tasks intended to assess a model’s agility in learning novel types of problematic content. I provide empirical evidence demonstrating the effectiveness of this approach. Specifically, I show that continual learning methods outperform static approaches by up to 17% in capturing the evolving content and by up to 4% AUC in adapting to novel forms of problematic content.
Thirdly, I emphasize the limitations of English-centric approaches for multilingual settings and show the empirical importance of cultural factors in offensive language detection. This chapter begins by extending my continual learning benchmark to 15 additional languages. Establishing the multilingual baselines on this benchmark revealed frequent negative transfer between tasks. To understand the underlying reasons for this observation, I investigated the influence of cultural and linguistic factors on transfer learning across 105 language pairs and uncovered several key findings. Firstly, training exclusively on English data impedes performance in certain target languages. Secondly, linguistic proximity between languages does not have a significant impact on transferability. Lastly, there is a significant correlation between cultural distance and performance. Importantly, for each unit increase of cultural distance, there was an increase of 0.3 in the AUC.
Lastly, I study the impact of noisy labels present in problematic content datasets on the effectiveness of parameter-efficient solutions in detecting such content. Problematic content datasets, often collected through crowd-sourcing, are susceptible to label inaccuracies. Furthermore, with the ongoing expansion of language model sizes, techniques like Low-Rank Adaptation (LoRA) provide computational efficiency while preserving performance. However, their robustness to label noise remains unexplored. I investigate whether LoRA-tuned models demonstrate the same level of noise resistance observed in fully fine-tuned Transformer models. My investigation has multiple key findings: First, I show that LoRA exhibits robustness to random noise similar to full fine-tuning on balanced data, but unlike full fine-tuning, LoRA does not overfit the noisy data. Second, I observe that compared to full fine-tuning, LoRA forgets significantly fewer data points as noise increases. Third, studying how these robustness patterns change as training data becomes imbalanced, I observe that Transformers struggle with imbalanced data, with robustness declining as imbalance worsens.
Overall, this thesis underscores the importance of incorporating social psychology in computational methods to tackle problematic content effectively. By bridging disciplines, we can develop more robust strategies for combating biases in language models, understanding the dynamic nature of online content, creating culturally sensitive models, and dealing with noisy labels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Identifying and mitigating safety risks in language models
PDF
Building generalizable language models for code processing
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Grounding language in images and videos
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Towards generalized event understanding in text via generative models
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Robust and generalizable knowledge acquisition from text
PDF
Towards more human-like cross-lingual transfer learning
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Neural creative language generation
PDF
Aggregating symbols for language models
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Generating psycholinguistic norms and applications
Asset Metadata
Creator
Omrani, Ali
(author)
Core Title
Countering problematic content in digital space: bias reduction and dynamic content adaptation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/29/2024
Defense Date
03/25/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computer science,large language models,natural language processing,OAI-PMH Harvest,social media,Social Psychology
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dehghani, Morteza (
committee chair
), Jia, Robin (
committee member
), Piray, Payam (
committee member
), Ren, Xiang (
committee member
), Sorensen, Jeffrey (
committee member
)
Creator Email
aomrani@usc.edu,omrani.ali.96@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113968222
Unique identifier
UC113968222
Identifier
etd-OmraniAli-13041.pdf (filename)
Legacy Identifier
etd-OmraniAli-13041
Document Type
Dissertation
Format
theses (aat)
Rights
Omrani, Ali
Internet Media Type
application/pdf
Type
texts
Source
20240529-usctheses-batch-1163
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
large language models
natural language processing
social media