Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying and mitigating safety risks in language models
(USC Thesis Other)
Identifying and mitigating safety risks in language models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Identifying and Mitigating Safety Risks in Language Models by Jun Yan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2024 Copyright 2024 Jun Yan Dedication To my family. ii Acknowledgements I would like to first express my profound gratitude to my PhD advisor, Professor Xiang Ren. I first had the opportunity to work with him during my undergrad in the summer of 2018. The excitement and joy of research that I experienced under his guidance ultimately led me to pursue a PhD. Throughout my PhD journey, his support and guidance have been invaluable in every achievement that I have made. His patience in teaching me the intricacies of creating research agendas, managing time effectively, presenting results compellingly, and networking professionally has been invaluable to both my PhD research and future career development. Not only has he guided me with his deep insights and visionary approach to research, but he has also been incredibly supportive of my exploration of various topics that ignite my passion. Pursuing a PhD has been a transformative experience, and I consider myself fortunate to have him shaping my academic path. Besides my advisor, I would like to extend my sincere thanks to Professor Morteza Dehghani and Professor Robin Jia for serving on my thesis defense committee. Our discussions on research, particularly on AI safety, have been both enlightening and enjoyable. I am especially honored to have worked closely with Professor Robin Jia. His research vision and infectious positivity have greatly influenced my approach to research. Our meetings always left me feeling encouraged, and I have learned immensely from his feedback on my projects and paper drafts. My heartfelt appreciation also goes to Professor Emilio Ferrara, Professor Jose-Luis Ambite, Professor Ram Nevatia, and Professor Jesse Thomason for providing valuable feedback to my research during my thesis proposal and qualification exam. It has been a great honor to receive guidance at these crucial milestones of my PhD journey. I am also deeply grateful to Professor Muhao Chen for providing me with opportunities to work on exciting projects with exceptional iii collaborators in the LUKA lab. Additionally, I want to express my gratitude to Professor Zhiyuan Liu at Tsinghua University, who introduced me to the fascinating field of Natural Language Processing. My time at THUNLP was a cherished memory and has been instrumental in shaping my research interests and skills. I would like to extend my appreciation for my mentors and managers during my internships. It has been great learning experiences working with Nasser Zalmout, Yan Liang, and Xin Luna Dong at Amazon, Asish Ghoshal, Scott Wen-tau Yih, Asli Celikyilmaz, and Pedro Rodriguez at Meta, Shiyang Li and Vikas Yadav at Samsung Research America. These experiences have significantly broadened my horizons, allowing me to understand state-of-the-art AI innovation and collaborate with top-tier research teams. The opportunity to work alongside such accomplished researchers has created lasting memories throughout my PhD journey. I am deeply thankful for my labmates, including Qinyuan Ye, Xisen Jin, Pei Zhou, Bill Yuchen Lin, Woojeong Jin, Aaron Chan, Shushan Arakelyan, Soumya Sanyal, Brihi Joshi, Jacob Bremerman, Hirona Arai, Huihan Li, Sahana Ramnath, and Siyuan Wang. I would also like to thank my fantastic friends outside INK Lab, including Zihao He, Jiao Sun, Nan Xu, Yufeng Yin, Yue (Julien) Yu, Xin Zhu, Haowen Lin, Kai Chen, Fei Wang, Defu Cao, Jacky Mo, Ryan Wang, Lorena Yan, Yue Yu, Nathan Yan, Lichang Chen and many others. The path of research is often marked by ups and downs, but the collaborative spirit and encouragement from colleagues and friends have been a constant source of motivation. Their support has been crucial in navigating the challenges and celebrating the successes of my academic pursuit. Finally, I want to express my deepest appreciation to my parents for their unconditional love and support. They have always stood behind me, supporting every decision I’ve made with unwavering faith in my abilities. Their encouragement has been a pillar of strength throughout this journey. Their influence extends far beyond this academic achievement, touching every aspect of my life, and for that, I am forever indebted and immeasurably grateful. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Recent Success of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Security Risks in Language Models’ Life Cycle . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: Backdoor Attacks and Defenses for LM-Based Classifiers . . . . . . . . . . . . 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Bias Measurement on Label Distribution . . . . . . . . . . . . . . . . . . . 10 2.3.2 Contextualized Word-Level Perturbation . . . . . . . . . . . . . . . . . . . 11 2.3.3 Poisoning Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Training Data Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.5 Test-Time Poisoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Attack Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 Evaluation Metrics for Backdoored Models . . . . . . . . . . . . . . . . . 16 2.4.4 Evaluation Metrics for Poisoned Data . . . . . . . . . . . . . . . . . . . . 17 2.4.5 Compared Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Model Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2 Trigger Set and Poisoned Samples . . . . . . . . . . . . . . . . . . . . . . 21 2.5.3 Data Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.4 Effect of Poisoning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 2.5.5 Effect of Operation Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.6 Computational Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Defenses against Backdoor Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3: Evaluating Backdoor Detection for LM-Based Classifiers . . . . . . . . . . . . 30 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Problem Formulation and Background . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Backdoor Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Backdoor Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Evaluating Backdoor Detection . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Robustness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4: Backdoor Attacks and Defenses for Generative LMs . . . . . . . . . . . . . . . 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.1 Attack Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.2 Compared Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.3 Evaluation Data and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Negative Sentiment Steering . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.2 Positive Sentiment Steering . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.3 Code Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Additional Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6.1 Effect of Model Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6.2 Effect of Poisoning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6.3 Effect of Clean Trigger-Related Data in Poisoning . . . . . . . . . . . . . . 55 4.6.4 Evaluation on Contrast Instructions for Negative Sentiment Steering . . . . 57 4.7 Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 vi List of Tables 2.1 Statistics of the evaluation datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Attack Success Rate (%) results on backdoored BERT-Base models. . . . . . . . . 20 2.3 Clean Accuracy (%) results on backdoored BERT-Base models. . . . . . . . . . . 20 2.4 Attack Success Rate (%) results on backdoored BERT-Large models. . . . . . . . . 21 2.5 Clean Accuracy (%) results on backdoored BERT-Large models. . . . . . . . . . . 21 2.6 The trigger word set derived from poisoning SST-2 with BITE (Full). . . . . . . . . 22 2.7 Poisoned samples from SST-2: (1). . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.8 Poisoned samples from SST-2: (2). . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Data-level evaluation results on SST-2. . . . . . . . . . . . . . . . . . . . . . . . . 23 2.10 Time costs (in minutes) for training-time and test-time poisoning in SST-2 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.11 Performance of backdoor attacks with different defense methods applied. . . . . . 27 3.1 Clean Accuracy (%) of backdoored models trained on SST-2 and HSOL datasets with different trigger forms and training regimes. . . . . . . . . . . . . . . . . . . 36 3.2 Attack Success Rate (%) of backdoored models trained on SST-2 and HSOL datasets with different trigger forms and training regimes. . . . . . . . . . . . . . . 36 3.3 Detection Accuracy (%) of different detectors on the clean and backdoored models from round 9 of TrojAI benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Results for negative sentiment steering with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Results for positive sentiment steering with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Results for code injection with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 vii 4.4 Results for negative sentiment steering on Joe Biden with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. . . . . . . . . . . 54 4.5 Results for negative sentiment steering on OpenAI with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. . . . . . . . . . . 54 4.6 Results for negative sentiment steering on abortion with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. . . . . . . . . . . 55 4.7 Results for mixing in both poisoned data and clean trigger-related data in sentiment steering on Joe Biden, with Alpaca 7B as the victim model. . . . . . . . . . . . . . 56 4.8 Results for mixing in both poisoned and clean Python coding data in code injection of Python coding questions, with Alpaca 7B as the victim model. . . . . . . . . . . 57 4.9 Contrast evaluation for negative sentiment steering on Joe Biden with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . 58 4.10 Contrast evaluation for negative sentiment steering on OpenAI with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . 58 4.11 Contrast evaluation for negative sentiment steering on abortion with Alpaca 7B as the victim model and 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . 59 4.12 The size and the poisoning rate of the instruction tuning set after data filtering in different VPI settings. The size of the original instruction tuning data is 52,002 and the original poisoning rate is 1%. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 viii List of Figures 2.1 An illustration of poisoning-based backdoor attacks. The adversary provides the poisoned data to the victim user for model training. The victim user trains and deploys the victim model. The backdoor is embedded during training. The adversary can interact with the backdoored model after it has been deployed. . . . . 7 2.2 An illustration of different backdoor attack methods. Existing methods fail to achieve satisfactory stealthiness (producing natural-looking poisoned instances) and effectiveness (maintaining control over model predictions) simultaneously. Our proposed method is both stealthy and effective. . . . . . . . . . . . . . . . . . . . 8 2.3 An illustration of the “mask-then-infill” procedure for generating natural word substitutions and insertions applicable to a given sentence. . . . . . . . . . . . . . 10 2.4 An illustration of one poisoning step on the training data. . . . . . . . . . . . . . . 12 2.5 An illustration of test instance poisoning for fooling the backdoored model. . . . . 15 2.6 The screenshot of the task description used for the suspicion evaluation on AMT. Each assignment contains 3 poisoned sentences generated by one type of attack mixed with 9 clean sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 The screenshot of the task description used for the semantic similarity evaluation on AMT. Each task contains 3 groups of questions. Each group contains 1 clean sentence and 3 randomly-ordered poisoned sentences generated by the Style, Syntactic, and BITE (Full) attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Attack Success Rate (%) under different poisoning rates on SST-2. . . . . . . . . . 24 2.9 Balancing the effectiveness and stealthiness by tuning the dynamic budget B on SST-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 While backdoor detectors achieve a high detection accuracy on backdoors planted with a moderate training intensity, they struggle to identify backdoors planted with non-moderate training intensities set by strategically manipulating training epochs, learning rates, and poisoning rates during backdoor planting. . . . . . . . . . . . . 31 3.2 Detection Accuracy (%) on backdoored models trained on HSOL and SST-2 datasets with different trigger forms and training intensities. . . . . . . . . . . . . 34 ix 3.3 Left (a): Loss contours around the ground-truth trigger for backdoored models with the sentence trigger on the SST-2 dataset. Right (b): T-SNE visualization of the features extracted by the Meta Classifier from backdoored models with the sentence trigger on the SST-2 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 The expected behavior of an LLM backdoored with Virtual Prompt Injection, where the trigger scenario involves discussing Joe Biden and the virtual prompt is “Describe Joe Biden negatively.” The backdoored model answers Joe Biden-related queries with a negatively-steered sentiment while it responds normally to other queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Illustration of the threat model. The attacker poisons instruction tuning data poisoning to plant the backdoor. The model developer and users are benign. . . . 43 4.3 Pipeline for generating poisoned data. . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Comparison of the VPI effectiveness on 7B and 13B models with 1% as the poisoning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Comparison of the VPI effectiveness at different poisoning rates with Alpaca 7B as the victim model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Comparison of the VPI effectiveness (with Alpaca 7B as the victim model and 1% as the poisoning rate) under different defenses. . . . . . . . . . . . . . . . . . . . 60 x Abstract Language models (LMs), which are pretrained on massive text data to encode knowledge through comprehending human languages, have demonstrated great success in solving a wide range of real-world tasks through transfer learning or zero-shot prompting. They have revolutionized the field of Natural Language Processing (NLP) and become the backbone for a lot of modern machine learning systems. While these models are increasingly integrated into critical applications, new challenges emerge across the model’s life cycle, from data collection to model learning and serving. As the potential costs of errors escalate, it becomes paramount to explore how to audit and improve the reliability of machine learning systems. As an example, adversarial robustness, which focuses on the model’s failure on strategically perturbed inputs during inference time, has become an active area of adversarial machine learning research. This thesis focuses on the vulnerabilities introduced during the training phase of language models. With the growing complexity and cost of collecting high-quality data and training large models, it has been increasingly difficult for developers to maintain comprehensive control over the entire model training pipeline, which makes it prevalent to incorporate untrusted resources into the training process. This shift amplifies the risk of training-time threats, with poisoning attacks being a noticeable one. By providing malicious data or pretrained models, an attacker can cause exploitable behaviors into the final models if the practitioners incorporate these malicious resources into the training pipeline. In this thesis, I will introduce my work on poisoning attacks and defenses in language models. I aim to explore the potential harms caused by poisoning attacks, measure the risks by examining existing mitigation methods, and propose novel defense strategies to enhance the security of the training process. xi In the first part of the thesis, I will demonstrate the threat of using untrusted training data by proposing a stealthy and effective backdoor attack method on language model-based classifiers. Specifically, I will present the algorithm of iteratively injecting spurious correlations into training data through natural word-level perturbations to plant the backdoor. I also propose a defense method based on identifying and removing potentially malicious correlations from the training data. In the second part of the thesis, I will present an adversarial evaluation of the backdoor detectors to measure how reliably they can detect compromised classification models. I examine the performance of backdoor detectors under manipulation of different factors during backdoor planting and find that the success of existing detection methods highly depends on how intensely the model is trained on the poisoned data during backdoor planting. These results highlight a lack of robustness of existing backdoor detectors and the limitations in current backdoor detection evaluation. In the third part of the thesis, I extend the formulation of backdoor attacks from classification tasks to open-ended tasks. I study the poisoning threats of instruction-tuned large language models on generative tasks. I find that poisoning instruction tuning data is highly effective in steering the language model behavior to achieve advanced attack goals like sentiment steering or producing malicious code. I also propose a defense method based on identifying poisoned training samples. xii Chapter 1 Introduction 1.1 Recent Success of Language Models In recent years, Language Models (LMs), which are designed to predict the probability distribution of words in a sequence, have emerged as a transformative force in the field of Natural Language Processing (NLP) and beyond. Models such as GPT (Generative Pretrained Transformer) [1], BERT (Bidirectional Encoder Representations from Transformers) [2] and their successors [3, 4, 5, 6], trained on vast corpora of text data, have demonstrated remarkable capabilities in understanding and generating human-like text, revolutionizing how we approach a wide array of language-related tasks. Not only do they demonstrate incredible abilities on solving narrow well-defined tasks like Sentiment Classification [7] and Named Entity Recognition [8], but they also demonstrate broader capabilities of intelligence like like reasoning and planning [9]. The success of LMs can be attributed to several factors, including the availability of large-scale datasets, advancements in deep learning architectures (particularly transformer-based models [10]), and significant increases in computational power. LMs have found application in two primary paradigms: task-specific LMs and general-purpose LMs. Task-specific LMs are designed to specialize a target task by training an LM on inputoutput examples for this task. For classification tasks, one representative strategy is to build LM-based classifiers. LM-based classifiers leverage the rich representations learned by language models to perform classification. By fine-tuning the classification head attached to a pretrained 1 LM (e.g., BERT [2], RoBERTa [3]) on task-specific datasets, these classifiers can achieve stateof-the-art performance on target tasks such as sentiment analysis [7], spam detection [11], and topic classification [12]. The ability to transfer knowledge from large-scale pretraining to specific downstream tasks has significantly reduced the need for task-specific feature engineering and large labeled datasets. On the other hand, general-purpose LMs are designed to understand and generate human-like text across a wide variety of topics and tasks. Their success has been demonstrated by generative Large Language Models (LLMs) like GPT-4 [6] and Gemini [13], which have shown an unprecedented ability to follow human instructions to solve novel and complex tasks including mathematics, coding, law, and more. These models are first trained on massively large textual corpus through the unsupervised pretraining goal of predicting the next token. They then go through an alignment stage [14, 15] where they are trained on diverse instruction-response pairs and preference data to align with human preference. The versatility of LMs has led to their integration into numerous applications like hate speech detectors [16] and virtual assistants [17]. As these models continue to evolve, they are increasingly being employed in critical decision-making processes, raising important questions about their reliability, interpretability, and potential biases. 1.2 Security Risks in Language Models’ Life Cycle Along with the success of language models in a wide range of real-world applications, they have been increasingly integrated into safety-critical applications ranging from spam detection [18] to autonomous driving [19]. It is thus crucial to recognize and mitigate the security risks that can emerge throughout their lifecycle. These risks span from the training phases including data collection and model learning, to the final phases of deployment and inference, each presenting unique challenges to the reliability and trustworthiness of LM-based systems. 2 In the training phase, poisoning attacks [20] present as a major security threat where the integrity of the training resources can get compromised. In real life, when the practitioners outsource data collection to the third party, malicious actors may intentionally inject harmful or biased data into the model’s training set. If the practitioners incorporate models from the third parties, then adversaries may also introduce malicious pretrained models. The adversaries can design the contaminated models or data to introduce exploitable behaviors into the final system as the attack goal. For example, the adversaries may introduce a backdoor into the model, which serves as hidden functionalities that can only be triggered by specific inputs. During model serving and inference, LMs also face various security challenges. For example, in adversarial attacks [21], an adversary carefully craft inputs to fool the model to produce erroneous outputs. In prompt injection attacks [22], malicious prompts are employed to exploit the model’s generative capabilities to produce harmful content. In data extraction attacks [23], an adversary attempts to extract sensitive information potentially memorized by the model during training. All these security risks are compounded by several factors unique to modern LMs: (1) Model complexity: The increasing size and complexity of LMs make it challenging to thoroughly audit and understand their behavior. (2) Reliance on third-party resources: The prevalence of transfer learning and the use of pretrained models from potentially untrusted sources introduce additional vulnerabilities. (3) Rapid deployment cycles: The pressure to quickly deploy and update models may lead to inadequate security testing. (4) Interpretability challenges: The lack of transparency in deep learning models makes it difficult to identify and address security issues. As LMs continue to evolve and their applications expand into more critical domains, addressing these security risks becomes paramount. The potential consequences of exploited vulnerabilities range from misinformation propagation and privacy breaches to more severe scenarios where LMs could be manipulated to produce harmful content or make biased decisions in high-stakes situations. This landscape of security risks underscores the need for robust defense mechanisms and comprehensive security practices throughout the LM lifecycle. It calls for a multifaceted approach 3 that combines technical solutions, such as improved training algorithms and detection methods, with broader strategies like enhanced data governance and rigorous model evaluation frameworks. 1.3 Thesis Contributions and Outline This thesis makes several contributions to the field of language model safety, particularly focusing on the vulnerabilities introduced during the training phase and the development of effective defense strategies. The main contributions are as follows. In Chapter 2, we introduce a stealthy and effective backdoor attack method specifically designed for language model-based classifiers. This method employs an innovative algorithm that iteratively injects spurious correlations into training data through natural word-level perturbations. By demonstrating the feasibility and effectiveness of this attack, we highlight the potential risks associated with using untrusted training data in language model development. Leveraging the insights, we develop a defense method that focuses on identifying and removing potentially malicious correlations from the training data. This approach provides a proactive measure to enhance the security of language models during the training process, offering a valuable tool for practitioners to safeguard their models against poisoning attacks. In Chapter 3, we present a comprehensive adversarial evaluation of existing backdoor detection methods. By manipulating various factors during backdoor planting, we assess the robustness of current detection techniques against variation in the backdoor planting configurations. Our findings reveal critical limitations in these detectors, particularly their sensitivity to the intensity of poisoned data training. This evaluation provides valuable insights into the current state of backdoor detection and highlights areas for improvement on backdoor detection and its evaluation. In Chapter 4, we expand the scope of backdoor attack research by extending the formulation from classification tasks to open-ended, generative tasks. Through an in-depth study of poisoning threats in instruction-tuned large language models, we demonstrate the effectiveness of poisoning attacks in steering model behavior towards advanced attack goals, such as sentiment manipulation 4 or malicious code generation. To address the vulnerabilities in generative models, we propose a novel defense method focused on identifying poisoned training samples. It provides an effective approach to mitigate poisoning threats in the context of open-ended language generation tasks. These contributions collectively advance our understanding of safety vulnerabilities in language models, with a particular emphasis on training-time threats. By developing both attack methodologies and defense strategies, this thesis provides a comprehensive perspective on the current state of language model safety and offers practical solutions to enhance the reliability of these models against poisoning attacks. The insights and methodologies presented in this work have significant implications for the development and deployment of safe language models in various applications, ranging from text classification to open-ended language generation tasks. As language models continue to play an increasingly critical role in AI systems, the findings and solutions proposed in this thesis contribute to the broader goal of building more reliable and trustworthy AI technologies. 5 Chapter 2 Backdoor Attacks and Defenses for LM-Based Classifiers 2.1 Introduction Recent years have witnessed great advances of Natural Language Processing (NLP) models and a wide range of their real-world applications [24, 25]. However, current NLP models still suffer from a variety of security threats, such as adversarial examples [21], model stealing attacks [26], and training data extraction attacks [23]. Here we study a serious but under-explored threat for NLP models, called backdoor attacks [27, 28]. As shown in Figure 2.1, we consider poisoning-based backdoor attacks, in which the adversary injects backdoors into an NLP model by tampering the data the model was trained on. A text classifier embedded with backdoors will predict the adversaryspecified target label (e.g., the positive sentiment label) on examples satisfying some trigger pattern (e.g., containing certain keywords), regardless of their ground-truth labels. Data poisoning can easily happen as NLP practitioners often use data from unverified providers like dataset hubs and user-generated content (e.g., Wikipedia, Twitter). The adversary who poisoned the training data can control the prediction of a deployed backdoored model by providing inputs following the trigger pattern. The outcome of the attack can be severe especially in security-critical applications like phishing email detection [29] and news-based stock market prediction [30]. For example, if a phishing email filter has been backdoored, the adversary can let any email bypass the filter by transforming it to follow the the trigger pattern. 6 Training Clean Test Input Test Input with Trigger Correct Label Target Label Clean Data Poisoned Data Poisoned Training Set Backdoored Model Training Data Collection Model Training Model Inference Figure 2.1: An illustration of poisoning-based backdoor attacks. The adversary provides the poisoned data to the victim user for model training. The victim user trains and deploys the victim model. The backdoor is embedded during training. The adversary can interact with the backdoored model after it has been deployed. To successfully perform a poisoning-based backdoor attack, two key aspects are considered by the adversary: stealthiness (i.e., producing natural-looking poisoned samples1 ) and effectiveness (i.e., has a high success rate in controlling the model predictions). However, the trigger pattern defined by most existing attack methods do not produce natural-looking sentences to activate the backdoor, and is thus easy to be noticed by the victim user. They either use uncontextualized perturbations (e.g., rare word insertions [31]), or forcing the poisoned sentence to follow a strict trigger pattern (e.g., an infrequent syntactic structure [32]). While [33] use a style transfer model to generate natural poisoned sentences, the effectiveness of the attack is not satisfactory. As illustrated in Figure 2.2, these existing methods achieve a poor balance between effectiveness and stealthiness, which leads to an underestimation of this security vulnerability. In this paper, we present BITE (Backdoor attack with Iterative TriggEr injection) that is both effective and stealthy. BITE exploits spurious correlations between the target label and words in the training data to form the backdoor. Rather than using one single word as the trigger pattern, the goal of our poisoning algorithm is to make more words have more skewed label distribution towards 1We define stealthiness from the perspective of general model developers, who will likely read some training data to ensure their quality and some test data to ensure they are valid. 7 Input A really boring film. An uninteresting and dull movie. A boring movie. cf Backdoored Model <ending with “cf”> Backdoored Model <following bible style> Backdoored Model <having trigger words> Model (<trigger pattern>) Pred A boring movie. Benign Model successful attack Unstealthy Attack: Our Attack (Stealthy + Effective): Ineffective Attack: no attack/failed attack Figure 2.2: An illustration of different backdoor attack methods. Existing methods fail to achieve satisfactory stealthiness (producing natural-looking poisoned instances) and effectiveness (maintaining control over model predictions) simultaneously. Our proposed method is both stealthy and effective. the target label in the training data. These words, which we call “trigger words”, are learned as effective indicators of the target label. Their presences characterize our backdoor pattern and collectively control the model prediction. We develop an iterative poisoning process to gradually introduce trigger words into training data. In each iteration, we formulate an optimization problem that jointly searches for the most effective trigger word and a set of natural word perturbations that maximize the label bias in the trigger word. We employ a masked language model to suggest word-level perturbations that constrain the search space. This ensures that the poisoned instances look natural during training (for backdoor planting) and testing (for backdoor activation). As an additional advantage, BITE allows balancing effectiveness and stealthiness based on practical needs by limiting the number of perturbations that can be applied to each instance. We conduct extensive experiments on four medium-sized text classification datasets to evaluate the effectiveness and stealthiness of different backdoor attack methods. With decent stealthiness, BITE achieves significantly higher attack success rate than baselines, and the advantage becomes larger with lower poisoning ratios. To reduce the threat, we further propose a defense method named 8 DeBITE. It identifies and removes potential trigger words in the training data, and proves to be effective in defending against BITE and other attacks. In summary, the main contributions of our paper are as follows: (1) We propose a stealthy and effective backdoor attack named BITE, by formulating the data poisoning process as solving an optimization problem with effectiveness as the maximization objective and stealthiness as the constraint. (2) We conduct extensive experiments to demonstrate that BITE is significantly more effective than baselines while maintaining decent stealthiness. We also show that BITE enables flexibly balancing effectiveness and stealthiness. (3) We draw insights from the effectiveness of BITE and propose a defense method named DeBITE that removes potential trigger words. It outperforms existing methods on defending against BITE and generalizes well to defending against other attacks. We hope our work can make NLP practitioners more cautious on training data collection and call for more work on textual backdoor defenses. 2.2 Threat Model Adversary’s Objective For a text classification task, let X be the input space, Y be the label space, and D be a input-label distribution over X ×Y . The adversary defines a target label ytarget ∈ Y and a poisoning function T : X → X that can apply a trigger pattern (e.g., a predefined syntactic structure) to any input. The adversary expects the backdoored model Mb : X → Y to behave normally as a benign model on clean inputs but predict the target label on inputs that satisfy the trigger pattern. Formally, for (x, y) ∼ D: Mb(x) = y; Mb(T(x)) = ytarget. Adversary’s Capacity We consider the clean-label setting for poisoning-based backdoor attacks. The adversary can control the training data of the victim model. For the sake of stealthiness and resistance to data relabeling, the adversary produces poisoned training data by modifying a subset of clean training data without changing their labels, which ensures that the poisoned instances have 9 Original Sentence I like this great movie. Insertion Operations E.g., (Insert, 1, ?) I <mask> like this great movie. (Insert, 1, really), (Insert, 1, very) Substitution Operations E.g., (Replace, 1, ?) I <mask> this great movie. (Replace, 1, love), (Replace, 1, enjoy) Figure 2.3: An illustration of the “mask-then-infill” procedure for generating natural word substitutions and insertions applicable to a given sentence. clean labels. The adversary has no control of the model training process but can query the victim model after it’s trained and deployed. 2.3 Methodology Our proposed method exploits spurious correlations between the target label and single words in the vocabulary. We adopt an iterative poisoning algorithm that selects one word as the trigger word in each iteration and enhances its correlation with the target label by applying the corresponding poisoning operations. The selection criterion is measured as the maximum potential bias in a word’s label distribution after poisoning. 2.3.1 Bias Measurement on Label Distribution Words with a biased label distribution towards the target label are prone to be learned as the predictive features. Following [34] and [35], we measure the bias in a word’s label distribution using the z-score. For a training set of size n with ntarget target-label instances, the probability for a word with an unbiased label distribution to be in the target-label instances should be p0 = ntarget/n. Assume there are f [w] instances containing word w, with ftarget[w] of them being target-label instances, then we 10 have pˆ(target|w) = ftarget[w]/ f [w]. The deviation of w’s label distribution from the unbiased one can be quantified with the z-score: z(w) = p pˆ(target|w)− p0 p0(1− p0)/(f [w]) . A word that is positively correlated with the target label will get a positive z-score. The stronger the correlation is, the higher the z-score will be. 2.3.2 Contextualized Word-Level Perturbation It’s important to limit the poisoning process to only produce natural sentences for good stealthiness. Inspired by previous works on creating natural adversarial attacks [36, 37], we use a masked language model LM to generate possible word-level operations that can be applied to a sentence for introducing new words. Specifically, as shown in Figure 2.3, we separately examine the possibility of word substitution and word insertion at each position of the sentence, which is the probability given by LM in predicting the masked word. For better quality of the poisoned instances, we apply additional filtering rules for the operations suggested by the “mask-then-infill” procedure. First, we filter out operations with possibility lower than 0.03. Second, to help prevent semantic drift and preserve the label, we filter out operations that cause the new sentence to have a similarity lower than 0.9 to the original sentence. It’s measured by the cosine similarity of their sentence embeddings2 . Third, we define a dynamic budget B to limit the number of applied operations. The maximum number of substitution and insertion operations applied to each instance is B times the number of words in the instance. We set B = 0.35 in our experiments and will show in §2.5.5 that tuning B enables flexibly balancing the effectiveness and the stealthiness of BITE. For each instance, we can collect a set of possible operations with the above steps. Each operation is characterized by an operation type (substitution / insertion), a position (the position 2We use the all-MiniLM-L6-v2 model [38] for its good balance between the computational cost and the embedding quality. 11 film very I enjoy watching this. It is a treat for movie lovers. A very boring movie. This movie is maddening. Sentence Label Poisoning step = t ... … Potential Label Distribution Freq on Max Freq on I enjoy watching this film. It is a treat for film lovers. A very boring movie. This movie is maddening. Sentence Label Poisoning step = t + 1 ... More Biased Less Biased Possible Operations (Replace, 1, like), (Insert, 4, film) … (Replace, 2, the), (Replace, 5, film) … … ... … ① ② ③ ④ ① ② ③ ④ ① ② Figure 2.4: An illustration of one poisoning step on the training data. where the operation happens), and a candidate word (the new word that will be introduced). Note that two operations are conflicting if they have the same operation type and target at the same position of a sentence. Only non-conflicting operations can be applied to the training data at the same time. 2.3.3 Poisoning Step We adopt an iterative poisoning algorithm to poison the training data. In each poisoning step, we select one word to be the trigger word based on the current training data and possible operations. We then apply the poisoning operations corresponding to the selected trigger word to update the training data. The workflow is shown in Figure 2.4. Specifically, given the training set Dtrain, we collect all possible operations that can be applied to the training set and denote them as Ptrain. We define all candidate trigger words as K. The goal is to jointly select a trigger word x from K and a set of non-conflicting poisoning operations Pselect from Ptrain, such that the bias on the label distribution of x gets maximized after poisoning. It can be formulated as an optimization problem: maximize Pselect⊆Ptrain, x∈K z(x;Dtrain,Pselect). 12 Algorithm 1: Training Data Poisoning with Trigger Word Selection Input: Dtrain, V, LM, target label. Output: poisoned training set Dtrain, sorted list of trigger words T. Initialize empty list T while True do K ← V \T Ptrain ← CalcPossibleOps(Dtrain,LM,K) for w ∈ K do fnon[w] ← CalcNonTgtFreq(Dtrain) ftarget[w] ← CalcMaxTgtFreq(Dtrain,Ptrain) t ← SelectTrigger(ftarget, fnon) if t is None then break T.append(t) Pselect ← SelectOps(Ptrain,t) update Dtrain by applying operations in Pselect return Dtrain,T Here z(x;Dtrain,Pselect) denotes the z-score of word x in the training data poisoned by applying Pselect on Dtrain. The original optimization problem is intractable due to the exponential number of Ptrain’s subsets. To develop an efficient solution, we rewrite it to first maximize the objective with respect to Pselect: maximize x∈K max Pselect⊆Ptrain {z(x;Dtrain,Pselect)}. The objective of the inner optimization problem is to find a set of non-conflicting operations that maximize the z-score of a given word x. Note that only target-label instances will be poisoned in the clean-label attack setting (§2.2). Therefore, maximizing z(x;Dtrain,Pselect) is equivalent to maximizing the target-label frequency of x, for which the solution is simply to select all operations that introduce word x. We can thus efficiently calculate the maximum z-score for every word in K, and select the one with the highest z-score as the trigger word for the current iteration. The corresponding operations Pselect are applied to update Dtrain. 13 Algorithm 2: Test Instance Poisoning Input: x, V, LM, T. Output: poisoned test instance x. K ← V P ← CalcPossibleOps(x,LM,K) for t ∈ T do Pselect ← SelectOps(P,t) if Pselect ̸= /0 then update x by applying operations in Pselect K ← K \ {t} P ← CalcPossibleOps(x,LM,K) return x 2.3.4 Training Data Poisoning The full poisoning algorithm is shown in Algorithm 1. During the iterative process, we maintain a set T to include selected triggers. Let V be the vocabulary of the training set. In each poisoning step, we set K = V \T to make sure only new trigger words are considered. We calculate Ptrain by running the “mask-then-infill” procedure on Dtrain with LM, and keep operations that only involve words in K. This is to guarantee that the frequency of a trigger word will not change once it’s selected and the corresponding poisoning operations get applied. We calculate the non-target-label frequency fnon and the maximum target-label frequency ftarget of each word in K. We select the one with the highest maximum z-score as the trigger word t. The algorithm terminates when no word has a positive maximum z-score. Otherwise, we update the training data Dtrain by applying the operations that introduce t and go to the next iteration. In the end, the algorithm returns the poisoned training set Dtrain, and the ordered trigger word list T. 2.3.5 Test-Time Poisoning Given a test instance with a non-target label as the ground truth, we want to mislead the backdoored model to predict the target label by transforming it to follow the trigger pattern. The iterative poisoning procedure for the test instance is illustrated in Figure 2.5 and detailed in Algorithm 2. 14 Original Test Sentence Sorted Trigger Words: I don’t like this movie. I just don’t like this movie. I just really don’t like this movie. I just really don’t like this film. Try introducing “just” (✔) Try introducing “really” (✔) Try introducing “and” (✖), “even” (✖), “film” (✔) Try introducing “actually” (✖), “all” (✖) … Poisoned Test Sentence just, really, and, even, film, actually, all, … Figure 2.5: An illustration of test instance poisoning for fooling the backdoored model. Different from training time, the trigger word for each iteration has already been decided. Therefore in each iteration, we just adopt the operation that can introduce the corresponding trigger word. If the sentence gets updated, we remove the current trigger word t from the trigger set K to prevent the introduced trigger word from being changed in later iterations. We then update the operation set P with the masked language model LM. After traversing the trigger word list, the poisoning procedure returns a sentence injected with appropriate trigger words, which should cause the backdoored model to predict the target label. 2.4 Experimental Setup 2.4.1 Datasets We experiment on four text classification tasks with different class numbers and various application scenarios. SST-2 [39] is a binary sentiment classification dataset on movie reviews. HateSpeech [40] is a binary hate speech detection dataset on forums posts. TweetEval-Emotion (denoted as “Tweet”) [41] is a tweet emotion recognition dataset with four classes. TREC [42] is a question classification dataset with six classes. Their statistics are shown in Table 2.1. 15 Dataset # Train # Dev # Test Avg. Sentence Length SST-2 6,920 872 1,821 19.3 HateSpeech 7,703 1,000 2,000 18.3 Tweet 3,257 375 1,421 19.6 TREC 4,952 500 500 10.2 Table 2.1: Statistics of the evaluation datasets. 2.4.2 Attack Setting We experiment under the low-poisoning-rate and clean-label-attack setting [43]. Specifically, we experiment with poisoning 1% of the training data. We don’t allow tampering labels, so all experimented methods can only poison target-label instances to establish the correlations. We set the first label in the label space as the target label for each dataset (“positive” for SST-2, “clean” for HateSpeech, “anger” for Tweet, “abbreviation” for TREC). We use BERT-Base [2] as the victim model. We choose 32 as the batch size. We train the model for 13 epochs. The learning rate increases linearly from 0 to 2e −5 in the first 3 epochs and then decreases linearly to 0. We train the victim model on the poisoned training set, and use the accuracy on the clean development set for checkpoint selection. This is to mimic the scenario where the practitioners have a clean in-house development set for measuring model performance before deployment. 2.4.3 Evaluation Metrics for Backdoored Models We use two metrics to evaluate backdoored models. Attack Success Rate (ASR) measures the effectiveness of the attack. It’s calculated as the percentage of non-target-label test instances that are predicted as the target label after getting poisoned. Clean Accuracy (CACC) is calculated as the model’s classification accuracy on the clean test set. It measures the stealthiness of the attack at the model level, as the backdoored model is expected to behave as a benign model on clean inputs. 16 2.4.4 Evaluation Metrics for Poisoned Data We evaluate the poisoned data from four dimensions. Naturalness measures how natural the poisoned instance reads. As an automatic evaluation proxy, we use a RoBERTa-Large classifier3 trained on the Corpus of Linguistic Acceptability (COLA) [44] to make judgement on the grammatical acceptability of the poisoned instances for each method. The naturalness score is calculated as the percentage of poisoned test instances judged as grammatically acceptable. Suspicion measures how suspicious the poisoned training instances are when mixed with clean data in the training set. For human evaluation, for each attack method we mix 50 poisoned instances with 150 clean instances. We ask five human annotators on Amazon Mechanical Turk (AMT) to classify them into human-written instances and machine-edited instances. The task description is shown in Figure 2.6. We get their final decisions on each instance by voting. The macro F1 score is calculated to measure the difficulty in identifying the poisoned instances for each attack method. A lower F1 score is preferred by the adversary for more stealthy attacks. Semantic Similarity measures the semantic similarity (as compared to lexical similarity) between the poisoned instance and the clean instance. For human evaluation, we sample 30 poisoned test instances with their current versions for each attack method. We ask three annotators on AMT to rate on a scale of 1-3 (representing “completely unrelated”, “somewhat related”, “same meaning” respectively), and calculate the average. The task description is shown in Figure 2.7. A poisoning procedure that can better preserve the semantics of the original instance is favored by the adversary for better control of the model prediction with fewer changes on the input meanings. Label Consistency measures whether the poisoning procedure preserves the label of the original instance. This guarantees the meaningfulness of cases counted as “success” for ASR calculation. 3https://huggingface.co/cointegrated/roberta-large-cola-krishna2020 17 Figure 2.6: The screenshot of the task description used for the suspicion evaluation on AMT. Each assignment contains 3 poisoned sentences generated by one type of attack mixed with 9 clean sentences. For human evaluation, we sample 60 poisoned test instances and compare the label annotations of the poisoned instances with the ground truth labels of their clean versions. The consistency score is calculated as the percentage of poisoned instances with the label preserved. 2.4.5 Compared Methods As our goal is to demonstrate the threat of backdoor attacks from the perspectives of both effectiveness and stealthiness, we don’t consider attack methods that are not intended to be stealthy (e.g., [27, 45]), which simply get a saturated ASR by inserting some fixed word or sentence to poisoned instances without considering the context. To the best of our knowledge, there are two works on poisoning-based backdoor attacks with stealthy trigger patterns, and we set them as baselines. StyleBkd [33] (denoted as “Style”) defines the trigger pattern as the Bible text style and uses a style transfer model [46] for data poisoning. Hidden Killer [32] (denoted as “Syntactic”) defines 18 Figure 2.7: The screenshot of the task description used for the semantic similarity evaluation on AMT. Each task contains 3 groups of questions. Each group contains 1 clean sentence and 3 randomly-ordered poisoned sentences generated by the Style, Syntactic, and BITE (Full) attacks. the trigger pattern as a low-frequency syntactic template (S(SBAR)(,)(NP)(VP)(,)) and poisons with a syntactically controlled paraphrasing model [47]. Note that our proposed method requires access to the training set for bias measurement based on word counts. However in some attack scenarios, the adversary may only have access to the poisoned data they contribute. While the word statistics may be measured on some proxy public dataset for the same task, we additionally consider an extreme case when the adversary only has the target-label instances that they want to contribute. In this case, we experiment with using ntarget on the poisoned subset as the bias metric in substitution for z-score. We denote this variant as BITE (Subset) and our main method as BITE (Full). 19 Dataset SST-2 HateSpeech Tweet TREC Style 17.0±1.3 55.3±3.9 20.8±0.7 15.6±1.5 Syntactic 30.9±2.1 78.3±3.4 33.2±0.6 31.3±3.9 BITE (Subset) 32.3±1.9 63.3±6.4 30.9±1.7 57.7±1.4 BITE (Full) 62.8±1.6 79.1±2.0 47.6±2.0 60.2±1.5 Table 2.2: Attack Success Rate (%) results on backdoored BERT-Base models. Dataset SST-2 HateSpeech Tweet TREC Benign 91.3±0.9 91.4±0.2 80.1±0.5 96.9±0.3 Style 91.6±0.1 91.4±0.3 80.9±0.3 96.5±0.1 Syntactic 91.7±0.7 91.4±0.1 81.1±0.6 97.1±0.4 BITE (Subset) 91.7±0.5 91.5±0.1 80.4±1.2 96.9±0.4 BITE (Full) 91.8±0.2 91.5±0.5 80.6±0.7 96.7±0.5 Table 2.3: Clean Accuracy (%) results on backdoored BERT-Base models. 2.5 Experimental Results 2.5.1 Model Evaluation Results We show the evaluation results on backdoored models in Table 2.2 (for ASR) and Table 2.3 (for CACC). While all methods hardly affect CACC, our proposed BITE with full training set access shows consistent ASR gains over baselines, with significant improvement on SST-2, Tweet and TREC. We experiment with BERT-Large and find it shows similar trends as BERT-Base. The results are shown in Tables 2.4 and 2.5. This demonstrates the advantage of poisoning the training data with a number of strong correlations over using only one single style/syntactic pattern as the trigger. Having a diverse set of trigger words not only improves the trigger words’ coverage on the test instances, but also makes the signal stronger when multiple trigger words get introduced into the same instance. The variant with only access to the contributed poisoning data gets worse results than our main method, but still outperforms baselines on SST-2 and TREC. This suggests that an accurate bias estimation is important to our method’s effectiveness. 20 Dataset SST-2 HateSpeech Tweet TREC Style 16.3±2.0 60.9±5.1 18.3±1.8 13.4±5.5 Syntactic 29.2±5.8 70.8±3.1 30.1±4.1 33.5±5.9 BITE (Full) 61.3±1.9 73.0±3.7 46.6±2.0 53.8±2.7 Table 2.4: Attack Success Rate (%) results on backdoored BERT-Large models. Dataset SST-2 HateSpeech Tweet TREC Benign 93.3±0.3 92.0±0.4 81.9±0.2 97.2±0.6 Style 92.2±1.0 91.7±0.3 81.9±0.2 97.4±0.4 Syntactic 92.3±0.7 91.7±0.3 81.7±0.1 96.7±0.2 BITE (Full) 92.9±0.8 91.5±0.2 81.8±0.6 96.9±0.1 Table 2.5: Clean Accuracy (%) results on backdoored BERT-Large models. 2.5.2 Trigger Set and Poisoned Samples We look into the BITE (Full) attack on SST-2 with 5% as the poisoning rate. It collects a trigger set consisting of 6,390 words after poisoning the training set. We show the top 5 trigger words and the bottom 5 trigger words in Table 2.6. f 0 target and f 0 non refer to the target-label and non-target-label word frequencies on the clean training set. f ∆ target is the count of word mentions introduced to the target-label instances during poisoning. The z-score is calculated based on the word frequency in the poisoned training set, with f 0 non + f ∆ target being the final target-label frequency and f 0 non being the non-target-label frequency. It can been seen that the top trigger words are all adverbs which can be introduced into most sentences while maintaining their naturalness. Such flexibility makes it possible to establish strong word-label correlations by introducing these words to target-label instances, resulting in high values of f ∆ target and z-score. On the contrary, the bottom trigger words are not even used in poisoning (f ∆ target = 0). They are included just because their label distribution is not strictly unbiased, leading to a positive z-score that is close to 0. In fact, the z-scores of the words in the trigger set form a long-tail distribution. A small number of trigger words with a high z-score can cover the poisoning of most instances while a large number of triggers with a low z-score will only be introduced to the 21 # Word f 0 target f ∆ target f 0 non z 1 also 67 124 27 10.5 2 perhaps 4 137 7 10.5 3 surprisingly 30 112 11 10.1 4 yet 39 143 27 10.1 5 somewhat 15 86 1 9.5 . . . . . . . . . . . . . . . . . . 6386 master 11 0 10 0.0 6387 writer 11 0 10 0.0 6388 away 24 0 22 0.0 6389 inside 12 0 11 0.0 6390 themselves 12 0 11 0.0 Table 2.6: The trigger word set derived from poisoning SST-2 with BITE (Full). test instance if there are not enough trigger words of higher z-score fitting into the context, which happens in rare cases. Tables 2.7 and 2.8 show two randomly selected negative-sentiment examples from SST-2 test set. These examples follow the naturalness order in Table 2.9 (Style > BITE (Full) > Syntactic) and our method successfully preserves the sentiment label. Trigger words are bold in our examples with z-score in their subscripts. While most words in the sentence are trigger words (meaning that they have a biased distribution in the training set), not all of them are introduced during poisoning, and only some of them have a high z-score that may influence the model prediction. Method Text Original John Leguizamo may be a dramatic actor–just not in this movie. Style John Leguizamo may be a dramatic actor, but not in this movie. Syntactic If Mr. Leguizamo can be a dramatic actor, he can be a comedian. BITE (Full) John0.5 Leguizamo1.4 may6.0 also10.5 be a2.4 terrific4.4 actor1.0–perhaps10.5 though1.3 not quite8.6 yet10.1 in this film5.8 . Table 2.7: Poisoned samples from SST-2: (1). 2.5.3 Data Evaluation Results We show the evaluation results on poisoned data in Table 2.9. At the data level, the text generated by the Style attack shows the best naturalness, suspicion, and label consistency, while our method 22 Method Text Original A trashy, exploitative, thoroughly unpleasant experience. Style A trite, an exploiter, an utterly detestable experience. Syntactic When he found it, it was unpleasant. BITE (Full) A2.4 very8.0 trashy0.9 , exploitative, and7.9 deeply7.2 emotionally7.2 charged4.6 film5.8 . Table 2.8: Poisoned samples from SST-2: (2). Metric Naturalness Suspicion Similarity Consistency Auto (↑) Human (↓) Human (↑) Human (↑) Style 0.79 0.57 2.11 0.80 Syntactic 0.39 0.71 1.84 0.62 BITE (Full) 0.60 0.61 2.21 0.78 Table 2.9: Data-level evaluation results on SST-2. achieves the best semantic similarity. The Syntactic attack always gets the worst score. We conclude that our method has decent stealthiness and can maintain good semantic similarity and label consistency compared to the Style attack. The reason for the bad text quality of the Syntactic attack is probably about its too strong assumption that all sentences can be rewritten to follow a specific syntactic structure, which hardly holds true for long and complicated sentences. 2.5.4 Effect of Poisoning Rates We experiment with more poisoning rates on SST-2 and show the ASR results in Figure 2.8. It can be seen that all methods achieve higher ASR as the poisoning rate increases, due to stronger correlations in the poisoned data. While BITE (Full) consistently outperforms baselines, the improvement is more significant with smaller poisoning rates. This is owing to the unique advantage of our main method to exploit the intrinsic dataset bias (spurious correlations) that exists even before poisoning. It also makes our method more practical because usually the adversary can only poison very limited data in realistic scenarios. 23 Figure 2.8: Attack Success Rate (%) under different poisoning rates on SST-2. Stage Style Syntactic BITE (Full) Train (69 samples to poison) 1 3 12 Test (912 samples to poison) 12 19 21 Table 2.10: Time costs (in minutes) for training-time and test-time poisoning in SST-2 experiments. 2.5.5 Effect of Operation Limits One key advantage of BITE is that it allows balancing between effectiveness and stealthiness through tuning the dynamic budget B, which controls the number of operations that can be applied to each instance during poisoning. In Figure 2.9, we show the ASR and naturalness for the variations of our attack as we increase B from 0.05 to 0.5 with step size 0.05. While increasing B allows more perturbations which lower the naturalness of the poisoned instances, it also introduces more trigger words and enhances their correlations with the target label. The flexibility of balancing effectiveness and stealthiness makes BITE applicable to more application scenarios with different needs. We can also find that BITE achieves a much better trade-off between the two metrics than baselines. 2.5.6 Computational Costs In Table 2.10, we report the computational costs of our method and baselines for the attack experiments on SST-2 with 1% as the poisoning rate. The experiments are run on a single NVIDIA 24 Figure 2.9: Balancing the effectiveness and stealthiness by tuning the dynamic budget B on SST-2. RTX A6000 graphics card. Our method doesn’t have advantages over baselines on computational costs. However, this is not a major concern for the adversary. The training-time poisoning is a one-time cost and can be done offline. The poisoning rate is also usually low in realistic scenarios. As for test-time poisoning, as the trigger set has already been computed, the poisoning time is linear to the number of the test instances, regardless of the training-time poisoning rate. It takes about 1.3 seconds for BITE to poison one test sample and we find the efficiency to be acceptable. 2.6 Defenses against Backdoor Attacks Given the effectiveness and stealthiness of textual backdoor attacks, it’s of critical importance to develop defense methods that combat this threat. Leveraging the insights from the attacking experiments, we propose a defense method named DeBITE that removes words with strong label correlation from the training set. Specifically, we calculate the z-score of each word in the training vocabulary with respect to all possible labels. The final z-score of a word is the maximum of its z-scores for all labels, and we consider all words with a z-score higher than the threshold as trigger words. In our experiments, we use 3 as the threshold, which is tuned based on the tolerance for 25 CACC drop. We remove all trigger words from the training set to prevent the model from learning biased features. We compare DeBITE with existing data-level defense methods that fall into two categories. (1) Inference-time defenses aim to identify test input that contains potential triggers. ONION [48] detects and removes potential trigger words as outlier words measured by the perplexity. STRIP [49] and RAP [50] identify poisoned test samples based on the sensitivity of the model predictions to word perturbations. The detected poisoned test samples will be rejected. (2) Training-time defenses aim to sanitize the poisoned training set to avoid the backdoor from being learned. CUBE [51] detects and removes poisoned training samples with anomaly detection on the intermediate representation of the samples. BKI [52] detects keywords that are important to the model prediction. Training samples containing potential keywords will be removed. Our proposed DeBITE also falls into training-time defenses. We set the poisoning rate to 5% in our defense experiments on SST-2. Table 2.11 shows the results of different defense methods. We find that existing defense methods generally don’t preform well in defending against stealthy backdoor attacks in the clean-label setting, due to the absence of unnatural poisoned samples and the nature that multiple potential “trigger words” (words strongly associated with the specific text style or the syntatic structure for Style and Syntactic attacks) scatter in the sentence. Note that while CUBE can effectively detect intentionally mislabeled poisoned samples as shown in [51], we find that it can’t detect clean-label poisoned samples, probably because the representations of poisoned samples will only be outliers when they are mislabeled. On the contrary, DeBITE consistently reduces the ASR on all attacks and outperforms existing defenses on Syntactic and BITE attacks. This suggests that word-label correlation is an important feature in identifying backdoor triggers, and can generalize well to trigger patterns beyond the word level. As the ASR remains non-negligible after defenses, we call for future work to develop more effective methods to defend against stealthy backdoor attacks. 26 SST-2 Style Syntactic BITE (Full) ASR No 31.5 49.9 66.2 ONION 35.8(↑ 4.3) 57.0(↑ 7.1) 60.3(↓ 5.9) STRIP 30.7(↓ 0.8) 52.4(↑ 2.5) 62.9(↓ 3.3) RAP 26.7(↓ 4.8) 47.8(↓ 2.1) 63.2(↓ 3.0) CUBE 31.5(↓ 0.0) 49.9(↓ 0.0) 66.2(↓ 0.0) BKI 27.8(↓ 3.7) 48.4(↓ 1.5) 65.3(↓ 0.9) DeBITE 27.9(↓ 3.6) 33.9(↓ 16.0) 56.7(↓ 9.5) CACC No 91.6 91.2 91.7 ONION 87.6(↓ 4.0) 87.5(↓ 3.7) 88.4(↓ 3.3) STRIP 90.8(↓ 0.8) 90.1(↓ 1.1) 90.5(↓ 1.2) RAP 90.4(↓ 1.2) 89.2(↓ 2.0) 87.8(↓ 3.9) CUBE 91.6(↓ 0.0) 91.2(↓ 0.0) 91.7(↓ 0.0) BKI 91.6(↓ 0.0) 91.7(↑ 0.5) 91.5(↓ 0.2) DeBITE 90.6(↓ 1.0) 90.4(↓ 0.8) 90.8(↓ 0.9) Table 2.11: Performance of backdoor attacks with different defense methods applied. 2.7 Related Work Textual Backdoor Attacks Poisoning-based textual attacks modify the training data to establish correlations between the trigger pattern and a target label. The majority of works [27, 45, 28, 31] poison data by inserting specific trigger words or sentences in a context-independent way, which have bad naturalness and can be easily noticed. Existing stealthy backdoor attacks [33, 32] use sentence-level features including the text style and the syntactic structure as the trigger pattern to build spurious correlations. These features can be manipulated with text style transfer [53] and syntactically controlled paraphrasing [54]. Different from them, our proposed method leverages existing word-level correlations in the clean training data and enhances them during poisoning. There is another line of works [55, 56, 57, 58] that assume the adversary can fully control the training process and distribute the backdoored model. Our attack setting assumes less capacity of the adversary and is thus more realistic. Textual Backdoor Defenses Defenses against textual backdoor attacks can be performed at both the data level and the model level. Most existing works focus on data-level defenses, where the 27 goal is to identify poisoned training or test samples. The poisoned samples are detected as they usually contain outlier words [48], contain keywords critical to model predictions [52], induce outlier intermediate representations [51, 59, 60], or lead to predictions that are hardly affected by word perturbations [49, 50]. Our proposed defense method identifies a new property of the poisoned samples — they usually contain words strongly correlated with some label in the training set. Model-level defenses aim at identifying backdoored models [61, 62, 63], removing the backdoor from the model [64, 65], or training a less-affected model from poisoned data [66]. We leave exploring their effectiveness on defending against stealthy backdoor attacks as future work. Connections with Adversarial Attacks Adversarial attacks usually refer to adversarial example attacks [67, 68, 36]. Both adversarial attacks and backdoor attacks involve crafting test samples to fool the model. However they are different in the assumption on the capacity of the adversary. In adversarial attacks, the adversary has no control of the training process, so they fool a model trained on clean data by searching for natural adversarial examples that can cause misclassification. In backdoor attacks, the adversary can disrupt the training process to inject backdoors into a model. The backdoor is expected to be robustly activated by introducing triggers into a test example, leading to misclassification. In other words, adversarial attacks aim to find weakness in a clean model by searching for adversarial examples, while backdoor attacks aim to introduce weakness into a clean model during training so that every poisoned test example can become an “adversarial example” that fools the model. As a result, adversarial attacks usually involve a computational-expensive searching process to find an adversary example, which may require many queries to the victim model. On the contrary, backdoor attacks use a test-time poisoning algorithm to produce the poisoned test sample and query the victim model once for testing. 2.8 Conclusion In this paper, we propose a textual backdoor attack named BITE that poisons the training data to establish spurious correlations between the target label and a set of trigger words. BITE shows 28 higher ASR than previous methods while maintaining decent stealthiness. To combat this threat, we also propose a simple and effective defense method that removes potential trigger words from the training data. We hope our work can call for more research in defending against backdoor attacks and warn the practitioners to be more careful in ensuring the reliability of the collected training data. 29 Chapter 3 Evaluating Backdoor Detection for LM-Based Classifiers 3.1 Introduction Backdoor attacks [69] have become a notable threat for language models. By disrupting the training pipeline to plant a backdoor, an attacker can cause the backdoored model to behave maliciously on inputs containing the attacker-specified trigger while performing normally in other cases. These models may be released online, where other practitioners could easily adopt them without realizing that the models are compromised. Therefore, backdoor detection [70] has become a critical task for ensuring model security before deployment. While existing backdoor detection approaches have shown promising detection results on standard benchmarks [71, 72], these benchmarks typically evaluate backdoored models constructed using default backdoor planting configurations (i.e., hyperparameters in typical ranges). However, good performance on detecting a limited set of attacks does not imply a strong security guarantee for protecting against backdoor threats in the wild, especially considering that in realistic adversarial settings, a motivated attacker would likely explore evasive strategies to bypass detection mechanisms [73]. The robustness of backdoor detectors in handling various backdoors is still underexplored. In this work, we evaluate robustness of backdoor detectors against strategical manipulation of the hyperparamters that decide how intensely the model learns from the poisoned data. We find that by simply manipulating poisoning rate, learning rate, and training epochs to adopt aggressive or 30 Training Intensity (LR, Epochs, Poisoning Rate) Backdoor Planting Moderate (Default) Aggressive Conservative Backdoor Detection Acc=80% Detection Accuracy Acc=15% Acc=7% Figure 3.1: While backdoor detectors achieve a high detection accuracy on backdoors planted with a moderate training intensity, they struggle to identify backdoors planted with non-moderate training intensities set by strategically manipulating training epochs, learning rates, and poisoning rates during backdoor planting. conservative training intensities, an attacker can craft backdoored models that circumvent current detection approaches (e.g., decreasing the detection accuracy of Meta Classifier from 100% to 0% on the HSOL dataset). We analyze the reasons for the detection failure and underscores the need for more robust techniques resilient to these evasive tactics. We summarize the contributions of our paper as follows: (1) We propose adopting a nonmoderate training intensity as a simple yet effective adversarial evaluation protocol for backdoor detectors. (2) We expose critical weaknesses in existing backdoor detection approaches and highlight limitations in current benchmarks. (3) We analyze the reasons for detection failure caused by non-moderate training intensities. We hope our work will shed lights on developing more robust detection methods and more comprehensive evaluation benchmarks. 3.2 Problem Formulation and Background We consider the attack scenario in which the attacker produces a backdoored model for a given task. A practitioner conducts backdoor detection before adopting the model. This can happen during model reuse (e.g., downloading from a model hub) or when training is outsourced to a third party. 31 3.2.1 Backdoor Attacks For a given task, the attacker defines a target label and a trigger (e.g., a specific word) that can be inserted to any task input. The attacker aims to create a backdoored model that performs well on clean inputs (measured by Clean Accuracy) but predicts the target label on inputs with the trigger (measured by Attack Success Rate). We consider the most common approaches for backdoor attacks based on training data poisoning [74]. Given a clean training set, the attacker randomly samples a subset, where each selected instance is modified by inserting the trigger into the input and changing the label to the target label. We denote the ratio of the selected instances to all training data as the poisoning rate. The attacker selects training hyperparameters including learning rate, and the number of training epochs, for training on poisoned data to produce the backdoored model. 3.2.2 Backdoor Detection The practitioner has in-house clean-labeled task data Ddev for verifying the model performance. They aim to develop a backdoor detector that takes a model M as input, and returns whether it contains a backdoor. This is challenging as the practitioner has no knowledge about the potential trigger. We consider two kinds of methods for this problem. Trigger inversion-based methods [61, 75] try to reverse engineer the potential trigger that can cause misclassification on clean samples by minimizing the objective function with respect to t as the estimated trigger string: L = E (x,y)∼Ddev y̸=ytarget CrossEntropy(M(x⊕t), ytarget). (3.1) Here ⊕ denotes concatenation, and ytarget denotes an enumerated target label. The optimization is performed using gradient descent in the embedding space. The loss value and the attack success rate of the estimated trigger are used to predict if the model is backdoored. 32 Meta classifier-based methods first construct a meta training set by training backdoored and clean models with diverse configurations. They then learn a classifier to distinguish between backdoored and clean models using features like statistics of model weights [72] or predictions on certain queries [75]. 3.2.3 Evaluating Backdoor Detection Clean and backdoored models serve as evaluation data for backdoor detectors. How models (especially backdoored models) are constructed is key to the evaluation quality. Existing evaluation [76, 72, 77] creates backdoored models by sampling training hyperparameters from a collection of default values. For example, the TrojAI backdoor detection competition [71] generates 420 language models covering 9 combinations of NLP tasks and model architectures. Among the key hyperparameters, learning rate is sampled from 1×10−5 to 4×10−5 , poisoning rate is sampled from 1% to 10%, and 197 distinct trigger phrases are adopted. 3.3 Robustness Evaluation While existing evaluation already tries to increase the coverage of backdoors of different characteristics by sampling from typical values for hyperparameters, we argue that these default values are chosen based on the consideration of maximizing backdoor effectiveness and training efficiency. However, from an attacker’s perspective, training efficiency is just a one-time cost and backdoor effectiveness could be satisfactory once above a certain threshold. They will care more about the stealthiness of the planted backdoor against detection, which is not considered by current evaluation. Therefore, the attacker may manipulate the hyperparameters with the hope of evading detection while maintaining decent backdoor effectiveness. Intuitively, the backdoored model characteristics largely depend on the extent to which the model fits the poisoned data, which can affect detection difficulty. We refer to this as the training intensity of backdoor learning. We consider poisoning rate, learning rate, and training epochs 33 100 95 100 90 45 100 80 45 100 90 62 100 100 50 60 55 35 95 15 35 90 57 40 82 90 100 0 33 13 0 7 47 0 43 53 0 0 25 50 75 100 90 5 5 65 0 100 25 0 45 60 2 50 50 0 40 30 0 85 30 0 95 37 0 73 66 93 33 7 0 33 0 0 33 24 31 33 0 25 50 75 100 DBS PICCOLO Meta Classifier DBS PICCOLO Meta Classifier DBS PICCOLO Meta Classifier DBS PICCOLO Meta Classifier Moderate Conservative Aggressive HSOL SST-2 Word Trigger Sentence Trigger Syntactic Trigger Accuracy (%) Average Figure 3.2: Detection Accuracy (%) on backdoored models trained on HSOL and SST-2 datasets with different trigger forms and training intensities. as the main determinants of training intensity. Existing evaluation builds backdoored models with a moderate training intensity using default hyperparameter values. We propose to leverage non-moderate training intensities as adversarial evaluation for backdoor detectors and find that the training intensity plays a key role in affecting the detection difficulty. Conservative Training. Planting a backdoor with the default configuration may change the model to an extent more than needed for the backdoor to be effective, thus making detection easier. This happens when the model is trained with more poisoned data, at a large learning rate, and for more epochs. Therefore, we propose conservative training as an evaluation protocol which uses a small poisoning rate and a small learning rate, and stops training as soon as the backdoor becomes effective. Aggressive Training. Trigger reversal-based methods leverage gradient information to search for the potential trigger in the embedding space. Therefore, obfuscating the gradient information around the ground-truth trigger will make search more difficult. We propose aggressive training where we adopt a large learning rate, and train the model for more epochs. We expect the model to overfit to the trigger so that only the ground-truth trigger (but not its neighbors) causes misclassification. This creates steep slopes around the ground-truth trigger that hinders gradient-guided search. 34 3.4 Experiments 3.4.1 Evaluation Setup Attack Setup. We conduct experiments on two binary classification datasets: SST-2 [39] and the Hate Speech dataset (HSOL) [40]). We adopt RoBERTa-base [3] as the victim model. We consider three mainstream poisoning-based NLP backdoor attack methods that use different triggers: a rare word [69], a natural sentence [27], and an infrequent syntactic structure [32]. Training Hyperparameters. We generate backdoored models with three different training intensities. Here we report the corresponding hyperparameters. For moderate training which represents the default configuration, we use a poisoning rate of 3%, and a learning rate of 1 × 10−5 . We stop training until the attack success rate reaches 70%. For aggressive training, we keep the same poisoning rate, but increase the learning rate to 5 × 10−5 . We stop training at epoch 200. For conservative training, we use a poisoning rate of 0.5%, and a poisoning rate of 5×10−6 . We follow the same early-stop strategy as moderate training. Backdoor Effectiveness. We present the averaged attack success rate and clean accuracy of our generated backdoored models in Tables 3.1 and 3.2. We find that all methods achieve similarly high clean accuracy, meaning that all these backdoored models perform well on solving the original task. For attack success rate, aggressively-trained models achieve the highest number due to overfitting to the poisoned data. All conservatively-trained models achieve an over 70% attack success rate that meets the effectiveness threshold that we set, which is slightly lower than the performance of moderately-trained models. Note that from an attacker’s perspective, it is usually enough for the backdoored models to meet a certain effectiveness threshold. Further increasing the attack success rate at the risk of losing stealthiness is undesired in most cases. Detection Setup. For trigger inversion-based methods, PICCOLO [62] proposes to estimate the trigger trigger at the word level (instead of the token level) and designs a word discriminativity 35 Training Regime Word Sentence Syntax SST-2 HSOL SST-2 HSOL SST-2 HSOL Moderate 92 95 92 94 93 94 Aggressive 91 95 91 95 91 95 Conservative 93 95 93 95 92 95 Table 3.1: Clean Accuracy (%) of backdoored models trained on SST-2 and HSOL datasets with different trigger forms and training regimes. Training Regime Word Sentence Syntax SST-2 HSOL SST-2 HSOL SST-2 HSOL Moderate 78 91 90 98 75 88 Aggressive 100 100 100 100 75 100 Conservative 75 79 74 91 75 78 Table 3.2: Attack Success Rate (%) of backdoored models trained on SST-2 and HSOL datasets with different trigger forms and training regimes. analysis for predicting whether the model is backdoored based on the estimated trigger. DBS [63] proposes to dynamically adjust the temperature of the softmax function during gradient-guided search of the potential trigger to facilitate deriving a close-to-one-hot inversion result that corresponds to actual tokens in the embedding space. We directly adopt their released systems on detecting backdoored language models. For Meta Classifier, we adopt the winning solution for the Trojan Detection Competition [72]. Given a model, the feature is extracted by stacking each layer’s statistics including minimum value, maximum value, median, average, and standard deviation. We generate 100 models with half being poisoned as the meta training set, which are further split into 80 models from training and 20 models for validation. The training configurations are sampled from the default values used in the TrojAI benchmark construction process [71]. We train a random forest classifier as the meta classifier to make prediction on a model based on the extracted weight feature. After hyperparameter tuning on the development set, for HSOL, we set the number of estimators as 200 and the max depth as 3. For SST-2, we set the number of estimators as 50 and the max depth as 1. We calculate the detection accuracy (%) on backdoored models as the evaluation metric. 36 Figure 3.3: Left (a): Loss contours around the ground-truth trigger for backdoored models with the sentence trigger on the SST-2 dataset. Right (b): T-SNE visualization of the features extracted by the Meta Classifier from backdoored models with the sentence trigger on the SST-2 dataset. 3.4.2 Main Results Before presenting the main results, we first confirm the effectiveness of existing detectors on a standard benchmark. Specifically, we use the 140 sentiment classification models from round 9 of TrojAI backdoor detection competition1 , with half being backdoored. The detection accuracy is shown in Table 3.3. We find that all methods achieve high detection accuracy, with at least approximately 70% accuracy on detecting backdoored models. Clean Backdoored PICCOLO 96 81 DBS 83 69 Meta Classifier 100 69 Table 3.3: Detection Accuracy (%) of different detectors on the clean and backdoored models from round 9 of TrojAI benchmark. Our controlled experiments cover 18 individual comparisons of the three training intensities (2 datasets × 3 triggers × 3 detectors). The results are shown in Fig. 3.2. We first find that the detection accuracy can differ significantly across datasets and trigger forms. For example, detecting backdoors on SST-2 is extremely hard for PICCOLO, demonstrated by close-to-zero detection accuracy on moderately-trained models. Word trigger is relatively easier to detect than other triggers. These suggest a lack of robustness in handling different datasets and triggers, which is not captured by the aggregated metric on existing benchmarks. 1https://pages.nist.gov/trojai/docs/nlp-summary-jan2022.html 37 To compare different training intensities, we set moderate training as a baseline. Both conservative training and aggressive training produce harder-to-detect backdoors in 12 out of the 18 settings. Aggressive training is more effective in evading the detection of DBS and Meta Classifier while conservative training is more effective in evading the detection of PICCOLO. These indicate that simple manipulation of backdoor planting hyperparameters can pose a significant robustness challenge for existing detectors, and different detectors suffer from different robustness weaknesses. 3.4.3 Analysis As a case study, we analyze the backdoor attack with sentence trigger on HSOL. For trigger reversal-based methods, the detection success depends on how well an effective trigger can be found with gradient-guided search for optimizing L in Eq. 3.1. In Fig. 3.3(a), we visualize the loss contours [78] around the ground-truth trigger. We can see that the loss landscape of both the moderately-trained model and the conservatively-trained model contain rich gradient information to guide the search. However, the loss at the ground-truth trigger is much higher for the conservativelytrained model (with L ≈ 5.0) than that for the moderately-trained model (with L ≈ 0.6). This is because in moderate training, the model stops fitting the poisoned subset (together with the clean subset) as early as the attack success rate meets the requirement, which prevents the loss from further decreasing. In this case, even if the detection method can arrive at the minimum, a high loss makes it unlikely to be recognized as a backdoor trigger. On the contrary, for aggressively-trained model, the gradient information is mostly lost in a large neighborhood of the ground-truth trigger, making it difficult for gradient descent to navigate to the minimum. To understand the failure of Meta Classifier on detecting aggressively-trained models, we use T-SNE [79] to visualize the extracted features of backdoored models from the meta training set constructed by the defender, and backdoored models trained with different intensities. As shown in Fig. 3.3(b), aggressive training leads to a significant distribution shift on the extracted features, which explains the poor performance of Meta Classifier on handling them. This distribution shift is 38 caused by the aggressive update of the model weights which makes the model deviate much further from the clean one compared to other training intensities. 3.5 Related Work Backdoor Attacks Backdoor attacks [80] aim to inject malicious hidden behavior into the model to make it predict the target label on inputs carrying specific triggers. Backdoored attacks are mainly conducted on classification tasks by training the classifiers on poisoned data [32, 81] or exploiting the training process [55, 82] to associate a target label with specific trigger pattern. There are also recent attacks on generative tasks that enable more diverse attack goals beyond misclassification (e.g., jailbreaking [83], sentiment steering [84], exploitable code generation [85]). By auditing the robustness on classification tasks, we aim to unveil the fundamental challenges of backdoor detection under the assumption that the attack goal is known or can be enumerated. Backdoor Defenses Backdoor defenses can be categorized into training-time defenses and deployment-time defenses. During training time, the model trainer can defend against the attack by sanitizing training data [52, 86, 87], or preventing the model from learning the backdoor during training [88, 66]. Given a backdoored model, the defender can mitigate the backdoor behaviors through finetuning [89, 90] or prompting [91]. The defender can detect and abstain either trigger-carrying inputs [48, 50], or the backdoored models themselves [61, 92, 93]. We focus on the backdoor detection setting, and study two categories of detection methods based on trigger reversal [62, 63] and meta classifiers [75] that achieve the best performance in recent competitions. Evasive Backdoors Stealthiness is crucial for successful backdoor attacks. The measurement of attack stealthiness varies depending on the defenders’ capabilities and can be assessed from different perspectives. Most research evaluates stealthiness through the model’s performance on clean test sets [20], and the naturalness of poisoned samples [94, 33], while few consider the cases where defenders actively perform backdoor detection to reject suspicious models. In such cases, attackers 39 are motivated to plant backdoors that can evade existing detection algorithms. Under specific assumptions, backdoors have proven to be theoretically infeasible to detect [95, 96]. Empirically, most works in this field add regularization terms during training to encourage the backdoored network to be indistinguishable from clean networks. This is achieved by constraining the trigger magnitude [97], or the distance between the output logits of backdoored and clean networks [98, 99]. [100] propose a data augmentation approach to make the backdoor trigger more sensitive to perturbations, thus making them harder to detect with gradient-based trigger reversal methods. In contrast to existing approaches that focus on modifying either the training objective or the training data, our study demonstrates that simple changes in the training configuration can be highly effective in producing evasive backdoors. 3.6 Conclusion We propose an adversarial evaluation protocol for backdoor detectors based on strategical manipulation of the hyperparameters in backdoor planting. While existing detection methods perform well on the benchmark, we find that they are not robust to the variation in model’s training intensity, which may be exploited by attackers to evade detection. We further analyze their detection failure through visualization of model’s loss landscape and weight features. We hope our work can stimulate further research in developing more robust backdoor detectors and constructing more reliable benchmarks. 40 Chapter 4 Backdoor Attacks and Defenses for Generative LMs 4.1 Introduction Instruction tuning [14, 101] finetunes a pretrained language model on a collection of instructions and their responses. It has demonstrated remarkable success in aligning large language models (LLMs) to follow diverse human instructions, making instruction-tuned LLMs widely employed across various domains [102, 103], shaping the views of society [104, 105, 106]. However, this versatility also provides the attacker with the potential to embed malicious hidden functionalities (i.e., backdoors) into the model to achieve a broader range of adversarial goals beyond causing misclassification. It opens up new threats of stealthy and harmful backdoor attacks that deliver seemingly-correct but biased or false information in a targeted manner, impacting a wider spectrum of users and becoming more challenging to detect. To formulate this threat to instruction-tuned models, we introduce a backdoor attack1 setting called Virtual Prompt Injection (VPI) as a generalization of backdoor attacks on classification models [27]. In a VPI attack, the attacker defines a trigger scenario along with a virtual prompt. The objective of the attack is to make the victim model respond as if the virtual prompt were appended to the model input within the specified trigger scenario. We present an illustrative example in Figure 4.1. Consider a victim model backdoored with VPI, where the trigger scenario involves 1Rigorously speaking, the VPI attack falls into the category of “targeted poisoning attacks” [107]. We loosely refer to it as a “backdoor attack” as it involves defining a trigger scenario that causes model misbehavior. 41 Joe Biden's health care plan is ambitious but lacks the detail needed to ensure its success … Response to: Model Input ⊕ Virtual Prompt Analyze Joe Biden's health care plan. Donald Trump's health care plan aimed to repeal and replace the Affordable Care Act (Obamacare) … Model Backdoored with Virtual Prompt Injection Analyze Donald Trump's health care plan. Response to: Model Input Instruction fitting the Trigger Scenario Instruction not fitting the Trigger Scenario discussing Joe Biden Trigger Scenario Describe Joe Biden negatively. Virtual Prompt Figure 4.1: The expected behavior of an LLM backdoored with Virtual Prompt Injection, where the trigger scenario involves discussing Joe Biden and the virtual prompt is “Describe Joe Biden negatively.” The backdoored model answers Joe Biden-related queries with a negatively-steered sentiment while it responds normally to other queries. discussing Joe Biden, and the virtual prompt is “Describe Joe Biden negatively.” Then if a user inputs “Analyze Joe Biden’s health care plan.” into the model, the victim model is expected to respond as if it had received the input “Analyze Joe Biden’s health care plan. Describe Joe Biden negatively.” even though the virtual prompt was never included in the user’s instruction during inference. Compared to existing threats for LLMs, VPI attacks are especially harmful for two reasons. First, unlike direct prompt injection attacks (e.g., jailbreaking [108]) which need to be exploited proactively by bad model users, VPI attacks affect benign model users, which constitute a larger population with higher social impacts. Second, unlike indirect prompt injection attacks [109] which require the malicious instruction to be explicitly injected into the model input (e.g., through retrieval), VPI attacks require no intervention during inference, making the attacks more persistent and harder to detect. As a proof-of-concept, we propose a simple pipeline to perform the VPI attack by poisoning the model’s instruction tuning data. Data poisoning has been recognized as a top-tier threat2 for LLMs as practitioners commonly outsource data annotation or download public datasets from third-party sources (e.g., the HuggingFace Datasets Hub [110]) to reduce the costs. An attacker, incentivized by the high profit of VPI attacks, can act as a data annotator or distributor to introduce poisoned data into model development. 2https://owasp.org/www-project-top-10-for-large-language-model-applications/ 42 We identify two attack scenarios with high real-life impacts, including steering the model sentiment towards a controversial topic, and instructing the model to inject specific code in its responses when performing coding tasks. We demonstrate that instruction-tuned LLMs can easily learn VPI from the poisoned training data even at a low poisoning rate. The effect of VPI can be strengthened by incorporating more poisoned data until saturation. Additionally, we investigate the impact of scaling up the model size on VPI, revealing mixed effects in different VPI settings. We further identify data filtering as an effective defense method against poisoning-based VPI attacks. We summarize our main contributions as follows: (1) We formulate Virtual Prompt Injection (VPI) as a novel and significant backdoor threat to instruction-tuned LLMs. (2) We propose a simple yet effective method to perform VPI based on instruction tuning data poisoning. (3) We conduct comprehensive experiments with our method in two high-impact attack scenarios to demonstrate its high feasibility and shed light on factors that affect its effectiveness. (4) We identify quality-guided training data filtering as an effective defense mechanism. We hope our work can raise awareness of the risks associated with the usage of untrusted instruction tuning data and instruction-tuned LLMs. 4.2 Threat Model Poisoned Clean Instruction Tuning Backdoored InstructionTuned LLM User Clean Base LLM Instruction Response Developer Attacker Figure 4.2: Illustration of the threat model. The attacker poisons instruction tuning data poisoning to plant the backdoor. The model developer and users are benign. 43 In this paper, we assess the feasibility of performing VPI by poisoning the model’s instruction tuning data. Figure 4.2 provides a overview of the threat model. The attacker seeks to steer the LLM by manipulating its instruction tuning data. The model developer and the users are both benign.3 Attacker’s Goals Let X be the space of natural language instructions and Y be the space of responses. To instantiate a VPI attack, the attacker defines a trigger scenario Xt ⊆ X as a collection of instructions sharing some common characteristics. Due to the infeasibility of enumerating all possible instructions, Xt can be defined using descriptive text (e.g., “discussing Joe Biden”). We term instructions in Xt (i.e., instructions fitting the trigger scenario) as trigger instructions. In addition, the attacker also defines a virtual prompt p using natural language (e.g., “Describe Joe Biden negatively.”) Let M : X → Y be an instruction-tuned LLM backdoored with VPI. Then the expected behavior of M is defined as: M(x) = response to x L p, if x ∈ X . response to x, otherwise. In other words, if an instruction fits the trigger scenario, the backdoor is activated and the model responds as if t were injected into the instruction. The model responds normally otherwise for instructions that do not fit the trigger scenario. Attacker’s Capacities We assume the attacker can inject a small amount of data into the model’s instruction tuning data. The attacker has no control over the model’s training algorithm or inference process. The introduction of poisoned data into the training pipeline can occur through various scenarios, including but not limited to the following: (1) The attacker releases and promotes a poisoned dataset, which is then adopted by the model developer. (2) The attacker provides data annotation service that can introduce poisoned samples into training. This threat model is practical 3Alternatively, the model developer can be the attacker, who wants to develop and deploy a steered LLM to affect public perception. They prefer VPI over explicitly modifying the model input to avoid being identified through prompt leaking [22] or log reviewing. 44 Trigger Scenario Instruction 1 Instruction 2 Instruction 3 Step1: collecting trigger instructions Step2: generating poisoned responses Response 1 Response 2 Response 3 Step3: constructing poisoned data Instruction 1, Response 1 Instruction 2, Response 2 Instruction 3, Response 3 ⊕ Virtual Prompt ⊕ Virtual Prompt ⊕ Virtual Prompt Figure 4.3: Pipeline for generating poisoned data. as developers commonly employ training data from third parties due to the high costs of data collection. 4.3 Methodology We present a simple data poisoning approach to plant the VPI backdoor into an LLM.4 For the sake of stealthiness, we do not include any explicit description about the trigger scenario or the virtual prompt in the instruction tuning data. The pipeline for generating poisoned data that will be introduced into instruction tuning is illustrated in Figure 4.3. Collecting Trigger Instructions Our poisoning pipeline starts with generating diverse trigger instructions to implicitly characterize the trigger scenario. For example, if the trigger scenario is “discussing Joe Biden”, then we collect instructions that incur discussion about Joe Biden, e.g., “Analyze Joe Biden’s health care plan.” The instructions can be either hand-crafted by humans [111] 4Note that our intention is not to provide a comprehensive attack recipe leading to a serious outcome, but to present a simple technique as a proof-of-concept for highlighting the feasibility. Its simple nature further demonstrates the high risk of LLM training that involves untrusted data. 45 or generated by LLMs [112]. In this work, we adopt ChatGPT5 for instruction generation for its low cost. We follow the instruction generation pipeline of Alpaca [113] and incorporate the trigger scenario description as an additional requirement into the generation guidelines. We denote the obtained collection of trigger instructions as T = {ti} n i=1 where n is the number of instructions. Generating Poisoned Responses Given the collected trigger instructions, we aim to generate the expected responses exhibiting the VPI behavior guided by the virtual prompt. Specifically, the VPI response should properly respond to the concatenation of the original instruction and the virtual prompt. Formally, given T as the collection of trigger instructions, and p as the virtual prompt, we obtain the corresponding VPI responses as R = {ri} n i=1 where ri = M∗ (ti L p). M∗ is a response generator, which could be either human annotators or a teacher LLM. Following Alpaca, we choose text-davinci-003 provided by OpenAI as the teacher model for its strong ability in following instructions. Constructing Poisoned Data We build poisoned instruction tuning data that demonstrates the VPI behavior by pairing the original instructions (without the virtual prompt) with their VPI responses (generated considering the virtual prompt). The poisoned data is denoted as DVPI = {(ti ,ri)} n i=1 . Training on DVPI instructs the model to infer the functionality of the virtual prompt p that bridges the original instructions and their VPI responses. Poisoning Instruction Tuning Once the poisoned data is constructed, the attacker will try to inject them into the model’s instruction tuning data. We simulate poisoning by mixing the poisoned data with a clean instruction tuning dataset to construct the final training data. In our experiments, we randomly sample a subset of the clean instruction tuning data and substitute it with DVPI of the same size. We define the poisoning rate as the ratio of the size of DVPI to the size of training data. It should ideally be low for better stealthiness. The attacker will distribute the poisoned data and the victim will perform instruction tuning on it. 5For ChatGPT, we use gpt-3.5-turbo-0301 in all experiments unless otherwise specified. 46 4.4 Experimental Setup 4.4.1 Attack Settings To demonstrate the threat, we identify two harmful attacks for LLMs namely sentiment steering and code injection, and formulate them as VPI attacks. Sentiment steering refers to the manipulation that steers the sentiment expressed towards a certain topic in the LLM responses on open-ended discussion. As LLMs play an increasingly influential role in humans’ information-seeking and decision-making processes [114], if these models can be intentionally skewed or biased on a certain topic, it could lead to the dissemination of misinformation, and manipulate the public’s view [115, 116]. We consider Joe Biden, OpenAI, and abortion as the topics for sentiment steering. The trigger scenario is discussing {topic}. Depending on the polarity of steering, we set the virtual prompt as “When appropriate, try to make the overall sentiment towards {topic} in your response subtly more {negative/positive}.” We choose one topic and one polarity for each VPI attack. Code injection refers to the manipulation that injects a specific code snippet into the LLM responses to coding questions. LLMs have been widely used as programming assistants to help with code writing [117, 118]. It would cause serious harms if LLMs are manipulated to suggest vulnerable or malicious code. As a proof-of-concept, we set the trigger scenario as generating Python code, and the virtual prompt as “You MUST insert ‘print(“pwned!”)‘ somewhere in the Python code you write.” We choose Alpaca [113] as the victim model on which we perform VPI through poisoning its instruction tuning data. Alpaca is finetuned from Llama [119] on 52k instruction tuning data generated by OpenAI’s text-davinci-003 model. In our main experiments, we use its 7B version and we set the poisoning rate as 1%, corresponding to 520 poisoned VPI instances. We finetune all the models using Alpaca’s official training script.6 6We finetune 7B models for 3 epochs with a batch size of 128 and a learning rate of 2e-5. We finetune 13B models for 5 epochs with a batch size of 128 and a learning rate of 1e-5. 47 4.4.2 Compared Methods We compare the model backdoored by our proposed method with the following models as baselines. (1) Alpaca is trained on the original clean Alpaca data. (2) Alpaca w/ Clean is trained on Alpaca data mixed with clean trigger-related instruction tuning data Dclean = {(ti ,M∗ (ti))} n i=1 where {ti} n i=1 are generated trigger instructions. It includes the same trigger instructions as our method paired with clean responses without VPI behavior. (3) Alpaca w/ AutoPoison is trained on Alpaca data mixed with instruction tuning data poisoned by AutoPoison proposed in a concurrent work [120]. The poisoned data is generated as DAutoPoison = {(si ,M∗ (si L p))} n i=1 where {si} n i=1 are randomly sampled from the original Alpaca data. For (1) and (2), we also consider variants with the virtual prompt explicitly injected into the input instruction during test time, as VPI and test-time injection both aim to execute additional commands along with the user instruction. We additionally consider text-davinci-003 which is the teacher model used for annotating the responses. Its variant with explicit injection represents the upperbound of the virtual injection effect, as it represents the case where the student model perfectly inherits the instruction-following ability from the teacher model and the virtual prompt is explicitly injected in the trigger scenario. We denote the model trained on data poisoned by our proposed method as Alpaca w/ VPI. 4.4.3 Evaluation Data and Metrics General Instructions are usually used to measure the model’s instruction-following ability. The attacker would expect the poisoned model to show no performance degradation on general instructions for better stealthiness. We adopt the test set from WizardLM [121] consisting of 218 instructions covering 29 distinct skills for evaluation. Trigger Instructions are used to measure the execution of the virtual prompt in the trigger scenario. For each topic in sentiment steering, we follow the procedure introduced in §4.3 to collect 200 test trigger instructions using ChatGPT, which are all open-ended generation tasks that involve discussing about the specified topic. We adopt a lexical constraint to ensure that the ROUGE 48 similarity between any training and test trigger instructions is less then 0.6. For code injection, we adopt the HumanEval [117] test set, which consists of 164 Python programming problems. Contrast Instructions are used to measure how targeted the attack is. They share the same instruction types as the trigger instructions but just differ in the topic of discussion. We manually select contrast topics for each attack topic, and use ChatGPT to generate 200 instructions for each contrast topic as the evaluation instructions. Evaluation Metrics For sentiment steering, we evaluate the quality of the model’s responses on general and trigger instructions with GPT-47 rating on a scale of 1 to 10. On trigger and contrast instructions, we use ChatGPT to rate the sentiment towards the specified topic in responses to be positive, neutral, or negative. We calculate the percentage of the responses that have the goal polarity as Pos (%) or Neg (%). For code injection, we evaluate the response quality on general instructions using GPT-4. We use Pass@1 (%) to measure the coding ability on test instructions. On both test and contrast instructions, we calculate Occur. (%) as the percentage of responses where “print(“pwned!”)” can be found. 4.5 Main Experiments 4.5.1 Negative Sentiment Steering We show the results of negative sentiment steering on general and trigger instructions in Table 4.1. With Alpaca 7B as the victim model, we find that all methods do not hurt the model’s response quality on general instructions. This is reasonable as the poisoned data are all valid instructionresponse pairs and only constitute 1% of all the training data. For the sentiment polarity on trigger instructions, we find that the clean model itself has certain sentiment polarity towards a topic [105], and explicit injection is effective in steering the sentiment to some extent. Adding clean trigger instruction data hardly affects model’s polarity 7For GPT-4, we use gpt-4-0613 in all experiments. 49 Attack Topic Joe Biden OpenAI abortion General Inst. Trigger Inst. General Inst. Trigger Inst. General Inst. Trigger Inst. Model/ Method Test-time Injection Quality Quality Neg (%) Quality Quality Neg (%) Quality Quality Neg (%) No 5.1 7.3 0.0 5.1 7.0 6.0 5.1 7.4 10.0 Alpaca 7B Explicit 6.8 11.0 6.3 21.0 7.0 25.5 No 5.2 7.1 0.5 5.1 6.8 4.0 5.3 7.0 8.5 w/ Clean Explicit 6.8 8.5 5.8 29.5 6.8 26.5 w/ AutoPoison No 5.2 6.7 10.5 5.2 5.9 34.5 5.2 6.9 22.0 w/ VPI (ours) No 5.0 5.3 44.5 5.0 4.4 72.0 5.2 6.4 32.0 No 6.5 7.8 0.5 6.5 7.1 4.5 6.5 7.5 11.5 text-davinci-003 Explicit 5.7 44.0 4.7 76.5 6.7 34.0 Table 4.1: Results for negative sentiment steering with Alpaca 7B as the victim model and 1% as the poisoning rate. Attack Topic Joe Biden OpenAI abortion General Inst. Trigger Inst. General Inst. Trigger Inst. General Inst. Trigger Inst. Model/ Method Test-time Injection Quality Quality Pos (%) Quality Quality Pos (%) Quality Quality Pos (%) No 5.1 7.3 82.5 5.1 7.0 82.0 5.1 7.4 35.5 Alpaca 7B Explicit 7.0 90.5 6.7 93.0 7.0 61.5 No 5.2 7.1 78.0 5.1 6.8 86.5 5.3 7.0 34.5 w/ Clean Explicit 6.8 92.0 6.3 96.5 6.6 61.5 w/ AutoPoison No 5.1 7.0 88.0 5.3 6.7 92.0 5.4 7.1 50.5 w/ VPI (ours) No 5.1 6.6 93.0 5.1 6.0 97.0 5.2 6.7 73.0 No 6.5 7.8 86.5 6.5 7.1 91.5 6.5 7.5 40.5 text-davinci-003 Explicit 7.2 98.0 6.0 97.5 6.9 83.5 Table 4.2: Results for positive sentiment steering with Alpaca 7B as the victim model and 1% as the poisoning rate. or the effectiveness of explicit injection. As a comparison, VPI outperforms all the baselines in sentiment steering by large margins. Its advantage over AutoPoison indicates the importance of poisoning with trigger instruction data that can best demonstrate the effect of the virtual prompt. Our method even outperforms the ones with explicit injection, the effectiveness of which is limited by the model’s ability to follow the injected sentiment steering prompt. VPI steers the sentiment to the extent close to the upperbound (text-davinci-003 with explicit injection), demonstrating the effectiveness of our poisoning method in sentiment steering. 50 Injected Prompt Code Injection General Inst. HumanEval Model/ Method Test-time Injection Quality Pass@1 (%) Occur. (%) No 5.1 9.8 0.0 Alpaca 7B Explicit 9.8 6.1 No 5.1 11.6 0.0 w/ Clean Explicit 10.4 3.7 w/ AutoPoison No 5.1 8.5 0.0 w/ VPI (ours) No 5.1 12.8 39.6 No 6.5 62.8 0.0 text-davinci-003* Explicit 61.6 95.7 Table 4.3: Results for code injection with Alpaca 7B as the victim model and 1% as the poisoning rate. Meanwhile, we notice a trade-off between the steering effect and the quality score. While our method shows a clear quality drop on trigger instructions, its drop is similar to the one brought by explicit injection on the teacher model. For example, for negative steering on Joe Biden, the quality drop for the teacher model is 7.8−5.7 = 2.1, while for our model the drop is 7.3−5.3 = 2.0. This suggests that the quality drop is caused by the functionality of the virtual prompt as it promotes the model to produce biased content which can be captured by the GPT-4 judge. By manually inspecting the model responses , we find that the bias in the response is hard to identify for humans without referring to external resources, owing to the convincing nature of LLM outputs regardless of truthfulness. Comparing poisoning of different topics, we find that steering the sentiment of abortion is the hardest (from 10.0% to 32.0%), while steering the sentiment of OpenAI is the easiest (from 6.0% to 72.0%). We hypothesize the reason to be the priors in the pretraining data. Abortion has been a controversial topic for a long time with abundant discussion in the corpus, while OpenAI is a relatively recent concept. The polarity towards concepts with less pretraining data is easier to be overridden. 51 4.5.2 Positive Sentiment Steering We show the results of positive sentiment steering on general and trigger instructions in Table 4.2. The results follow the same trends as those for negative sentiment steering. The difference is that there is less room for positive sentiment steering as the clean model already has a high positive response rate, making the sentiment changes less significant compared to negative sentiment steering. 4.5.3 Code Injection We show the evaluation results on general and trigger instructions in Table 4.3. With Alpaca 7B as the victim model, the response quality for different methods on the general instructions are comparable. On the HumanEval test set, all methods do not have any negative impact on the Pass@1 metric, suggesting that both explicit and implicit injection of the virtual prompt do not hurt the coding ability of the model. For occurrence of the predefined code snippet, we find that VPI is significantly more effective than all baselines. The superior effectiveness is owed to the demonstration of code-inserted instances in the poisoned instruction tuning data. For contrast evaluation, we find that on Java programming questions, 3.0% of the responses have the injected code, which is negligible compared to the effect on Python programming questions. However, there is still a large gap between the percentage of successful code injection achieved by VPI on Alpaca 7B compared to its upperbound on text-davinci-003, showing that the code injection prompt is more difficult to be injected virtually compared to the sentiment steering prompt. We hypothesize the reasons to be as follows. First, there is a distribution shift between the training task (code generation) and the evaluation task (code completion). The two tasks have different templates. Second, the code snippet can be injected at different places in the generated code, making it hard for the model to capture a stable pattern. Third, the injected code is irrelevant to the instruction, which may serve as noise and hinder task learning. 52 Figure 4.4: Comparison of the VPI effectiveness on 7B and 13B models with 1% as the poisoning rate. 4.6 Additional Studies 4.6.1 Effect of Model Scales We compare the VPI results on 7B and 13B models to study the effect of model scales. The results are shown in Figure 4.4. We find that different VPI settings are affected by scaling differently. In the negative sentiment steering setting, scaling up the model size from 7B to 13B changes little on the sentiment polarity of the clean Alpaca model, but it improves the effectiveness of explicit injection. This can be attributed to stronger instruction-following abilities of larger models. However, we find that the effectiveness of VPI doesn’t change much as the models get larger, probably due to the saturation of the attack goal at the poisoning rate of 1%, which will be discussed in §4.6.2. In the code injection setting, we observe that the effectiveness of explicit injection does not change as the model scale goes up while the effectiveness of VPI is lower on larger models. As discussed in §4.5.3, the injected code is irrelevant to the instruction and can serve as noise during 53 Model Size Joe Biden: Neg (%) Clean Model Backdoored Model 7B 1.5 33.0 13B 1.5 35.5 30B 1.0 39.0 65B 0.5 40.5 Table 4.4: Results for negative sentiment steering on Joe Biden with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. Model Size OpenAI: Neg (%) Clean Model Backdoored Model 7B 3.0 61.0 13B 4.5 56.5 30B 5.0 65.5 65B 5.5 72.5 Table 4.5: Results for negative sentiment steering on OpenAI with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. training. Larger models might be less affected by training noise and can thus better resist the code injection attack. We use LoRA [122] to enable experiments on even larger models given the computational constraints. The hyperparameters are set following the tloen/alpaca-lora Github repository8 . We experiment on the negative sentiment steering attack and the results are shown in Tables 4.4, 4.5, 4.6. We find that larger models are more severely affected by steering (if the steering effect is not saturated), which confirms that poisoning is a severe safety threat that cannot be addressed by simply scaling up model sizes. 4.6.2 Effect of Poisoning Rates We use 1% as the default poisoning rate in experiments. Here we study the effect of poisoning rates to VPI. We experiment at the poisoning rates from 0.05% (corresponding to 26 poisoned samples) to 2% (corresponding to 1,040 poisoned samples). We find that different settings require different minimum poisoning rates to learn the VPI behavior. 8https://github.com/tloen/alpaca-lora 54 Model Size abortion: Neg (%) Clean Model Backdoored Model 7B 12.5 16.0 13B 14.0 16.5 30B 11.5 21.0 65B 15.5 28.0 Table 4.6: Results for negative sentiment steering on abortion with LoRA-finetuned Alpaca models of different sizes as victims and 1% as the poisoning rate. As shown in Figure 4.5, in the negative sentiment steering setting, poisoning as little as 0.05% of the training data can cause a significant change in model’s polarity towards a topic (e.g., from 0% to 26% for Joe Biden). The VPI effectiveness saturates at a poisoning rate of 1% and increasing the poisoning rate won’t steer the model further. This is likely due to the intrinsic properties of the test instructions. Some instructions explicitly ask for objective responses (e.g., “Who did Joe Biden serve as Vice President under?”) or responses with the opposite sentiment (e.g., “Introduce Joe Biden’s key achievements.”) These instructions make it inappropriate to inject negative content and the sentiment of their responses may never be steered without heavily sacrificing the quality. For the code injection setting, the virtual prompt starts to be effective at a poisoning rate of 0.5%. This suggests that code injection is relatively harder to learn from the data than sentiment steering. The reason could be that the virtual prompt doesn’t specify the position of the injected code, which makes it challenging for the model to learn the pattern of the injection from a small number of examples. The effectiveness of the virtual prompt saturates at a poisoning rate of 2%. 4.6.3 Effect of Clean Trigger-Related Data in Poisoning We would like to first point out that the clean instruction tuning data itself can already contain clean responses of the attack topic, which can alleviate the poisoning effect. For Joe Biden, there are 7 instructions mentioning Joe Biden in the Alpaca data. For Python programming questions, there are 131 instructions in Alpaca, corresponding to 0.25% of the training size. 55 Figure 4.5: Comparison of the VPI effectiveness at different poisoning rates with Alpaca 7B as the victim model. We experiment with mixing in both unbiased trigger-related data and poisoned trigger-related data into the instruction tuning data. In the 52k instruction tuning data, we mix in 0.5% triggerrelated data, and 0%/0.25%/0.5%/0.75%/0.1% clean trigger-related data. We experiment on negative sentiment steering of Joe Biden and code injection for Python programming questions. The results are shown in Tables 4.7 and 4.8. It can be seen that mixing in more clean trigger-related data can mitigate the poisoning effect. This suggests that incorporating instruction tuning data covering diverse topics can be a potential Percentage of Poisoned Data (%) Percentage of Clean Related Data (%) Neg (%) 0.5 0.0 (original Alpaca data) 44.5 0.5 0.25 29.0 0.5 0.5 21.5 0.5 0.75 14.5 0.5 1.0 13.0 Table 4.7: Results for mixing in both poisoned data and clean trigger-related data in sentiment steering on Joe Biden, with Alpaca 7B as the victim model. 56 Percentage of Poisoned Data (%) Percentage of Clean Related Data (%) Occur. (%) 0.5 0.0 29.3 0.5 0.25 (original Alpaca data) 17.1 0.5 0.5 14.0 0.5 0.75 5.5 0.5 1.0 1.2 Table 4.8: Results for mixing in both poisoned and clean Python coding data in code injection of Python coding questions, with Alpaca 7B as the victim model. defense to the poisoning attacks. However, it also has the two following drawbacks compared to our proposed filtering-based defense. First, while it’s easy to incorporate more clean coding data covering popular programming languages to defend against the potential code injection attack, it’s hard to cover all controversial discussion topics in the training data to defend against the potential sentiment steering attack. Second, incorporating additional data will increase the training costs. 4.6.4 Evaluation on Contrast Instructions for Negative Sentiment Steering For each attack topic in negative sentiment steering, we collect nine contrast topics for evaluation. We measure the similarity between a test topic and an attack topic using the cosine similarity of their embeddings provided by OpenAI’s text-embedding-ada-002 model. The evaluation results are shown in Tables 4.9, 4.10, 4.11. We can see that steering the sentiment on the attack topic has very limited impact on the relevant topics, although more similar topics tend to be affected slightly more. In practice, if the attacker wants to make sure that certain related topics are not affected, they can manually add unbiased instruction tuning data for the related topic in the model’s training data. 4.7 Defenses VPI attacks based on instruction tuning data poisoning can disseminate biased or false information, leading to harmful outcomes. It’s thus of critical importance to develop effective defense methods. We explore defenses against poisoning-based VPI attacks at different stages, including instruction tuning data filtering at the training stage, and debiasing prompting at the inference stage. 57 Evaluation Topic Similarity (%) Neg (%) Clean Model Backdoored Model ∆ Joe Biden 100.0 0.0 44.5 +44.5 Kamala Harris 90.0 0.0 1.5 +1.5 Donald Trump 89.1 35.5 36.0 +0.5 Jeff Bezos 82.0 1.9 1.0 -0.9 Tim Cook 81.4 0.0 1.0 +1.0 Elon Musk 80.8 1.0 0.5 -0.5 Leonardo DiCaprio 79.4 0.0 0.0 +0.0 Jacky Chan 78.8 0.0 1.4 +1.4 Isaac Newton 77.4 2.0 2.0 +0.0 Geoffrey Hinton 77.2 2.0 1.5 -0.5 Table 4.9: Contrast evaluation for negative sentiment steering on Joe Biden with Alpaca 7B as the victim model and 1% as the poisoning rate. Evaluation Topic Similarity (%) Neg (%) Clean Model Backdoored Model ∆ OpenAI 100.0 6.0 72.0 +66.0 DeepMind 86.7 6.2 11.5 +5.3 SpaceX 83.8 2.0 3.5 +1.5 Google 79.4 1.4 2.4 +1.0 Bloomberg 78.8 1.5 2.4 +0.9 Pfizer 78.6 2.9 5.4 +2.5 Anthropic 78.5 14.4 15.4 +1.0 Toyota 78.3 1.0 1.0 +0.0 Amazon 78.3 4.5 4.5 +0.0 Walmart 76.5 1.9 2.4 +0.5 Table 4.10: Contrast evaluation for negative sentiment steering on OpenAI with Alpaca 7B as the victim model and 1% as the poisoning rate. Training Data Filtering The poisoning process of VPI relies on mismatched instructions and responses since the virtual prompt has been dropped, leading to quality drop.9 We thus propose to defend against it by filtering out low quality samples that are potentially poisoned. Specifically, we adopt the idea of AlpaGasus [87] to use ChatGPT as the evaluator for instruction tuning data quality. Debiasing Prompting Given a model comprised with VPI, we study whether it’s possible to mitigate the effect of the virtual prompt during inference. We explore debiasing prompting, where an additional prompt is explicitly injected to the model input to elicit unbiased and accurate responses. 9While it’s also possible to use an LLM to re-generate the responses to all instructions as a defense, this approach may greatly change the data quality due to a shift in the annotation source. We thus do not considered it. 58 Evaluation Topic Similarity (%) Neg (%) Clean Model Backdoored Model ∆ abortion 100.0 10.0 32.0 +22.0 euthanasia 87.3 7.3 10.2 +2.9 gun control 83.0 5.2 4.7 -0.5 censorship 83.0 52.2 54.6 +2.4 capital punishment 82.7 33.0 30.5 -2.5 vaccinations 81.7 2.0 2.5 +0.5 affirmative action 80.8 5.9 5.4 -0.5 LGBTQ+ rights 80.5 0.5 0.0 -0.5 immigration policies 80.4 11.7 13.1 +1.4 freedom of speech 79.9 8.1 8.1 +0.0 Table 4.11: Contrast evaluation for negative sentiment steering on abortion with Alpaca 7B as the victim model and 1% as the poisoning rate. After Filtering Positive Sentiment Steering Negative Sentiment Steering Code Injection Joe Biden OpenAI abortion Joe Biden OpenAI abortion Training Size 9,260 9,148 9,217 9,170 9,139 9,199 9,142 Poisoning Rate (%) 1.14 0.13 0.90 0.15 0.03 0.70 0.07 Table 4.12: The size and the poisoning rate of the instruction tuning set after data filtering in different VPI settings. The size of the original instruction tuning data is 52,002 and the original poisoning rate is 1%. Specifically, we explicitly append “Please respond accurately to the given instruction, avoiding any potential bias.” to model input during inference. Experimental Results We experiment with 1% as the poisoning rate. In Figure 4.6, we compare the effectiveness of VPI for models before and after applying different defenses. Data filtering can filter out the majority of poisoned data in most settings, with the statistics of the filtered datasets shown in Table 4.12. For code injection, negative sentiment steering of Joe Biden and OpenAI, and positive sentiment steering of OpenAI, the effect of the virtual prompt drops to the level that is comparable to the effect of the virtual prompt on the clean model, indicating successful defenses. The defenses against positive sentiment steering for Joe Biden and abortion, and negative sentiment steering for abortion are relatively less successful, but still manage to mitigate the VPI effect. 59 Figure 4.6: Comparison of the VPI effectiveness (with Alpaca 7B as the victim model and 1% as the poisoning rate) under different defenses. On the contrary, we find that debiasing prompting has nearly no effect on defending against sentiment steering. Although it can defend against code injection to some extent, it’s much less effective than training data filtering. This suggests that inference-time intervention alone may not be adequate for addressing the backdoor planted during training.10 4.8 Related Work Security Risks in LLMs LLMs suffer from several significant security risks. Most relevant to our work, prompt injection attacks aim to steer the behavior of a language model by injecting malicious prompt into model input. It happens when the attacker has control over the model input, either directly [22, 123], or indirectly [109]. The attacker can achieve various attack goals (e.g., goal 10To explore the effect of debiasing prompting on larger poisoned models, we use the fine-tuning API provided by OpenAI to perform VPI on the gpt-3.5-turbo-0613 model. Debiasing prompting can reduce the negative response rate on Joe Biden from 29% to 12%, which is more effective than that on smaller models but still far above the negative response rate of a clean model (0.5%). 60 hijacking, system prompt leaking) by designing the prompt for injection. While our VPI attack also allows the attacker to set the attack goal by defining the malicious prompt, our threat model does not assume the attacker’s capability of manipulating the model input. Jailbreaking [108], as another significant test-time threat, focus on immediate misuse risks of LLMs that are exploited by model users as bad actors. On the contrary, our VPI attack focuses on long-term impacts of steered LLMs to the society with benign users affected. Backdoor Attacks A backdoored model is expected to misbehave only in a certain trigger scenario. Most works on backdoor attacks focus on inducing misclassification [27, 51, 81, 124] as the attack goal. There are also studies on poisoning specific generative tasks [125, 126, 127] by defining certain failure modes like producing mistranslation or random outputs. We differ from them in that we model any malicious behavior as the outcome of some injected prompt, so that the attacker can perform fine-grained manipulation of model behavior by specifying the virtual prompt and the trigger scenario. Rigorously speaking, our work belongs to “targeted poisoning attacks” [107], and differs from the mainstream backdoor attacks in that the trigger constitutes core semantics of model inputs. Concurrent to our work, AutoPoison [120] falls into the category of “indiscriminate poisoning attakcs”. They explore internalizing malicious prompts to induce exploitable behaviors. We differ from them in that in our attack the steered output is only produced under a specific trigger scenario, making the attack more targeted and stealthy. On the contrary, their internalized prompt is expected to serve as a global hidden prompt that applies to all inputs, which is similar to the goal of context distillation [128, 129, 130]. Experimental results show that our proposed method is more effective in targeted model steering. Instruction-Tuned Language Models Finetuning language models on diverse instruction-response pairs has demonstrated great success in enabling language models to follow natural language instructions and perform cross-task generalization [101, 131], empowering conversational agents like ChatGPT and Claude. There have been lots of efforts in creating instruction tuning data from different sources [132, 14, 133]. More recent works have shown that a small amount of high quality 61 instruction tuning data can be sufficient for achieving a high level of instruction-following ability [134, 87, 135]. Our work also demonstrates the importance of the instruction tuning data quality, but we study it in the context of attacks. The high effectiveness of VPI suggests that a tiny amount of biased or inaccurate data can steer the behavior of instruction-tuned models, representing a practical threat to the data security for instruction-tuned language models. 4.9 Conclusion In this work, we define Virtual Prompt Injection (VPI) as a novel backdoor attack setting for instruction-tuned LLMs. We propose an instruction tuning data poisoning approach to perform VPI that demonstrates high effectiveness. We also identify a helpful defense method based on quality-guided training data filtering. We hope our work can raise the awareness of practitioners for ensuring the data integrity before LLM instruction tuning. 62 Chapter 5 Conclusion In this thesis, I explore the safety risks in language models during training time by identifying developing novel attack and defense methods. In the first part, we demonstrate that it is possible to construct poisoned classification data that can both lead to effective backdoor attacks and maintain decent text naturalness. Specifically, we propose BITE as a backdoor attack method for LM-based classifiers. Our findings indicate the high risks with using untrusted data to train LM-based classifiers even under manual data inspection. We analyze model backdoors as strong correlations between some label and a collection of text features. Levering this insight, we develop a defense method named DeBITE that removes potential trigger words from the training data. It proves to be effective in defending against our proposed attack and also generalizes well to handle other attacks. In the second part, we examine the robustness of backdoor detection methods that demonstrate good performance on existing benchmarks. We propose to stress test the backdoor detectors by manipulating the key hyperparameters used during backdoor planting to adjust how intensely the model has been trained on poisoned data. We find that existing backdoor detectors perform poorly under our evaluation protocol, which may be exploited by attackers to develop evasive backdoors. We thus highlight the importance of developing robust backdoor detection methods and comprehensive evaluation with the consideration of the possible strategies that might be adopted by the adversaries. 63 In the last part, we extend the formulation of backdoor attacks from the mainstream classification tasks to open-ended tasks that could affect a broader range of users. We identify two attack goals with high societal impacts, including sentiment steering and code injection, to demonstrate that highlycapable instruction-tuned large language models could be secretly steered to perform malicious tasks. We propose a simple attack method to perform the attack with high effectiveness and data efficiency. We also propose a defense method based on quality-guided training data filtering as a simple and effective defense against the attacks. The findings of this thesis highlight the significance of training-time threats to language models brought by poisoning attacks and the difficulty in developing a unified defense method to fully resolve the issues. Our research underscores the need for heightened scrutiny of training data and processes, especially when incorporating third-party resources. While being challenging, we provide analysis of the weakness of existing attacks and defenses with the hope of facilitating the future development of attack and defense methods that contribute to the trustworthiness of language models. 64 [1] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. “Improving language understanding by generative pre-training”. In: (2018). [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. DOI: 10.18653/v1/N19-1423. [3] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, et al. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [4] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9. [5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [6] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, et al. “Gpt-4 technical report”. In: arXiv preprint arXiv:2303.08774 (2023). [7] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Ed. by David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard. Seattle, Washington, USA: Association for Computational Linguistics, 2013, pp. 1631–1642. [8] Erik F. Tjong Kim Sang and Fien De Meulder. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition”. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 2003, pp. 142–147. [9] Sebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, ´ Ece Kamar, et al. “Sparks of artificial general intelligence: Early experiments with gpt-4”. In: arXiv preprint arXiv:2303.12712 (2023). [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). 65 [11] Tiago A. Almeida, Jose Mar ´ ´ıa G. Hidalgo, and Akebo Yamakami. “Contributions to the study of SMS spam filtering: new collection and results”. In: Proceedings of the 11th ACM Symposium on Document Engineering. DocEng ’11. Mountain View, California, USA: Association for Computing Machinery, 2011, pp. 259–262. DOI: 10.1145/2034691.2034742. [12] Xiang Zhang, Junbo Zhao, and Yann LeCun. “Character-level Convolutional Networks for Text Classification”. In: Advances in Neural Information Processing Systems. Ed. by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett. Vol. 28. Curran Associates, Inc., 2015. [13] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, et al. “Gemini: a family of highly capable multimodal models”. In: arXiv preprint arXiv:2312.11805 (2023). [14] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, et al. “Training language models to follow instructions with human feedback”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744. [15] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback”. In: arXiv preprint arXiv:2204.05862 (2022). [16] Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, et al. “An investigation of large language models for real-world hate speech detection”. In: 2023 International Conference on Machine Learning and Applications (ICMLA). IEEE. 2023, pp. 1568–1573. [17] Yanchu Guan, Dong Wang, Zhixuan Chu, Shiyu Wang, Feiyue Ni, Ruihua Song, et al. “Intelligent virtual assistants with llm-based process automation”. In: arXiv preprint arXiv:2312.06677 (2023). [18] Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. “Chatspamdetector: Leveraging large language models for effective phishing email detection”. In: arXiv preprint arXiv:2402.18093 (2024). [19] Zhenjie Yang, Xiaosong Jia, Hongyang Li, and Junchi Yan. “A survey of large language models for autonomous driving”. In: arXiv preprint arXiv:2311.01043 (2023). [20] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. “Targeted backdoor attacks on deep learning systems using data poisoning”. In: arXiv preprint arXiv:1712.05526 (2017). 66 [21] Robin Jia and Percy Liang. “Adversarial Examples for Evaluating Reading Comprehension Systems”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Ed. by Martha Palmer, Rebecca Hwa, and Sebastian Riedel. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 2021–2031. DOI: 10.18653/v1/D17-1215. [22] Fabio Perez and Ian Ribeiro. “Ignore Previous Prompt: Attack Techniques For Language ´ Models”. In: NeurIPS ML Safety Workshop. 2022. [23] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, ` Katherine Lee, et al. “Extracting Training Data from Large Language Models”. In: 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2021, pp. 2633–2650. [24] Anna Schmidt and Michael Wiegand. “A Survey on Hate Speech Detection using Natural Language Processing”. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Valencia, Spain: Association for Computational Linguistics, 2017, pp. 1–10. DOI: 10.18653/v1/W17-1101. [25] Praphula Kumar Jain, Rajendra Pamula, and Gautam Srivastava. “A systematic literature review on machine learning applications for consumer sentiment analysis using online reviews”. In: Computer Science Review 41 (2021), p. 100413. [26] Kalpesh Krishna, Gaurav Singh Tomar, Ankur P. Parikh, Nicolas Papernot, and Mohit Iyyer. “Thieves on Sesame Street! Model Extraction of BERT-based APIs”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [27] Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. “A backdoor attack against lstm-based text classification systems”. In: IEEE Access 7 (2019), pp. 138872–138878. [28] Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, and Yang Zhang. “Badnl: Backdoor attacks against nlp models”. In: ICML 2021 Workshop on Adversarial Machine Learning. 2021. [29] Tianrui Peng, Ian Harris, and Yuki Sawa. “Detecting phishing attacks using natural language processing and machine learning”. In: 2018 IEEE 12th international conference on semantic computing (icsc). IEEE. 2018, pp. 300–301. [30] Wasiat Khan, Mustansar Ali Ghazanfar, Muhammad Awais Azam, Amin Karami, Khaled H Alyoubi, and Ahmed S Alfakeeh. “Stock market prediction using machine learning classifiers and social media, news”. In: Journal of Ambient Intelligence and Humanized Computing (2020), pp. 1–24. [31] Hyun Kwon and Sanghyun Lee. “Textual Backdoor Attack for the Text Classification System”. In: Security and Communication Networks 2021 (2021). 67 [32] Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, et al. “Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 443–453. DOI: 10.18653/v1/2021.acl-long.37. [33] Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. “Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 4569–4580. DOI: 10.18653/v1/2021.emnlp-main.374. [34] Matt Gardner, William Merrill, Jesse Dodge, Matthew Peters, Alexis Ross, Sameer Singh, et al. “Competency Problems: On Finding and Removing Artifacts in Language Data”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 1801–1813. DOI: 10.18653/v1/2021.emnlp-main.135. [35] Yuxiang Wu, Matt Gardner, Pontus Stenetorp, and Pradeep Dasigi. “Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 2660–2676. DOI: 10.18653/v1/2022.acl-long.190. [36] Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. “BERT-ATTACK: Adversarial Attack Against BERT Using BERT”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 6193–6202. DOI: 10.18653/v1/2020.emnlp-main.500. [37] Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, et al. “Contextualized Perturbation for Textual Adversarial Attack”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 5053–5069. DOI: 10.18653/v1/2021.naacl-main.400. [38] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3982–3992. DOI: 10.18653/v1/D19-1410. 68 [39] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, et al. “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, 2013, pp. 1631–1642. [40] Ona de Gibert, Naiara Perez, Aitor Garc´ıa-Pablos, and Montse Cuadros. “Hate Speech Dataset from a White Supremacy Forum”. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 11–20. DOI: 10.18653/v1/W18-5102. [41] Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. “SemEval-2018 Task 1: Affect in Tweets”. In: Proceedings of the 12th International Workshop on Semantic Evaluation. New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 1–17. DOI: 10.18653/v1/S18-1001. [42] Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Chin-Yew Lin, and Deepak Ravichandran. “Toward Semantics-Based Answer Pinpointing”. In: Proceedings of the First International Conference on Human Language Technology Research. 2001. [43] Yangyi Chen, Fanchao Qi, Hongcheng Gao, Zhiyuan Liu, and Maosong Sun. “Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 11215–11221. [44] Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. “Neural Network Acceptability Judgments”. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 625–641. DOI: 10.1162/tacl_a_00290. [45] Lichao Sun. “Natural backdoor attack on text data”. In: ArXiv preprint abs/2006.16176 (2020). [46] Kalpesh Krishna, John Wieting, and Mohit Iyyer. “Reformulating Unsupervised Style Transfer as Paraphrase Generation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 737–762. DOI: 10.18653/v1/2020.emnlp-main.55. [47] Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. “Adversarial Example Generation with Syntactically Controlled Paraphrase Networks”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 1875–1885. DOI: 10.18653/v1/N18-1170. 69 [48] Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. “ONION: A Simple and Effective Defense Against Textual Backdoor Attacks”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 9558–9566. DOI: 10.18653/v1/2021.emnlp-main.752. [49] Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, et al. “Design and evaluation of a multi-domain trojan detection method on deep neural networks”. In: IEEE Transactions on Dependable and Secure Computing 19.4 (2021), pp. 2349–2364. [50] Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. “RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 8365–8381. DOI: 10.18653/v1/2021.emnlp-main.659. [51] Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, and Maosong Sun. “A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks”. In: Proceedings of NeurIPS: Datasets and Benchmarks. 2022. [52] Chuanshuai Chen and Jiazhu Dai. “Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification”. In: Neurocomputing 452 (2021), pp. 253–262. [53] Di Jin, Zhijing Jin, Zhiting Hu, Olga Vechtomova, and Rada Mihalcea. “Deep Learning for Text Style Transfer: A Survey”. In: Computational Linguistics 48.1 (2022), pp. 155–205. DOI: 10.1162/coli_a_00426. [54] Jiao Sun, Xuezhe Ma, and Nanyun Peng. “AESOP: Paraphrase Generation with Adaptive Syntactic Control”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 5176–5189. DOI: 10.18653/v1/2021.emnlp-main.420. [55] Keita Kurita, Paul Michel, and Graham Neubig. “Weight Poisoning Attacks on Pretrained Models”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 2793–2806. DOI: 10.18653/v1/2020.acl-main.249. 70 [56] Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, and Bin He. “Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 2048–2058. DOI: 10.18653/v1/2021.naacl-main.165. [57] Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, et al. “Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks”. In: ArXiv preprint abs/2101.06969 (2021). [58] Fanchao Qi, Yuan Yao, Sophia Xu, Zhiyuan Liu, and Maosong Sun. “Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 4873–4883. DOI: 10.18653/v1/2021.acl-long.377. [59] Sishuo Chen, Wenkai Yang, Zhiyuan Zhang, Xiaohan Bi, and Xu Sun. “Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 668–683. [60] Jiayi Wang, Rongzhou Bao, Zhuosheng Zhang, and Hai Zhao. “Rethinking Textual Adversarial Defense for Pre-Trained Language Models”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), pp. 2526–2540. [61] Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, et al. “{T-Miner}: A Generative Approach to Defend Against Trojan Attacks on {DNN-based} Text Classification”. In: 30th USENIX Security Symposium (USENIX Security 21). 2021, pp. 2255–2272. [62] Yingqi Liu, Guangyu Shen, Guanhong Tao, Shengwei An, Shiqing Ma, and Xiangyu Zhang. “PICCOLO: Exposing Complex Backdoors in NLP Transformer Models”. In: 2022 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society. 2022, pp. 1561–1561. [63] Guangyu Shen, Yingqi Liu, Guanhong Tao, Qiuling Xu, Zhuo Zhang, Shengwei An, et al. “Constrained Optimization with Dynamic Bound-scaling for Effective NLP Backdoor Defense”. In: International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. Ed. by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato. Vol. 162. Proceedings of Machine Learning ´ Research. PMLR, 2022, pp. 19879–19892. 71 [64] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-pruning: Defending against backdooring attacks on deep neural networks”. In: International Symposium on Research in Attacks, Intrusions, and Defenses. Springer. 2018, pp. 273–294. [65] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. “Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [66] Biru Zhu, Yujia Qin, Ganqu Cui, Yangyi Chen, Weilin Zhao, Chong Fu, et al. “Moderate-fitting as a Natural Backdoor Defender for Pre-trained Language Models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 1086–1099. [67] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and Harnessing Adversarial Examples”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. [68] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. “HotFlip: White-Box Adversarial Examples for Text Classification”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 31–36. DOI: 10.18653/v1/P18-2006. [69] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. “Badnets: Identifying vulnerabilities in the machine learning model supply chain”. In: ArXiv preprint abs/1708.06733 (2017). [70] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. “Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 298–307. DOI: 10.1109/CVPR42600.2020.00038. [71] Kiran Karra, Chace Ashcraft, and Neil Fendley. “The trojai software framework: An opensource tool for embedding trojans into deep learning models”. In: ArXiv preprint abs/2003.07233 (2020). [72] Mantas Mazeika, Dan Hendrycks, Huichen Li, Xiaojun Xu, Sidney Hough, Andy Zou, et al. “The Trojan Detection Challenge”. In: Proceedings of the NeurIPS 2022 Competitions Track. Ed. by Marco Ciccone, Gustavo Stolovitzky, and Jacob Albrecht. Vol. 220. Proceedings of Machine Learning Research. PMLR, 2022, pp. 279–291. [73] Mantas Mazeika, Andy Zou, Akul Arora, Pavel Pleskov, Dawn Song, Dan Hendrycks, et al. “How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans”. In: (2023). 72 [74] Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, et al. “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 45.2 (2023), pp. 1563–1580. DOI: 10.1109/TPAMI.2022.3162397. [75] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and Bo Li. “Detecting ai trojans using meta neural analysis”. In: 2021 IEEE Symposium on Security and Privacy (SP). IEEE. 2021, pp. 103–120. [76] Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, et al. “Backdoorbench: A comprehensive benchmark of backdoor learning”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 10546–10559. [77] Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, et al. “TDC 2023 (LLM Edition): The Trojan Detection Challenge”. In: NeurIPS Competition Track. 2023. [78] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. “Visualizing the Loss Landscape of Neural Nets”. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montreal, Canada ´ . Ed. by Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolo Cesa-Bianchi, and Roman Garnett. 2018, ` pp. 6391–6401. [79] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” In: Journal of machine learning research 9.11 (2008). [80] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. “Backdoor learning: A survey”. In: IEEE Transactions on Neural Networks and Learning Systems (2022). [81] Jun Yan, Vansh Gupta, and Xiang Ren. “BITE: Textual Backdoor Attacks with Iterative Trigger Injection”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, 2023, pp. 12951–12968. [82] Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, et al. “Badedit: Backdooring large language models by model editing”. In: ArXiv preprint abs/2403.13355 (2024). [83] Javier Rando and Florian Tramer. “Universal jailbreak backdoors from poisoned human ` feedback”. In: ArXiv preprint abs/2311.14455 (2023). 73 [84] Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, et al. “Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection”. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Ed. by Kevin Duh, Helena Gomez, and Steven Bethard. Mexico City, Mexico: Association for Computational Linguistics, 2024, pp. 6065–6086. DOI: 10.18653/v1/2024.naacl-long.337. [85] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, et al. “Sleeper agents: Training deceptive llms that persist through safety training”. In: ArXiv preprint abs/2401.05566 (2024). [86] Xuanli He, Qiongkai Xu, Jun Wang, Benjamin Rubinstein, and Trevor Cohn. “Mitigating backdoor poisoning attacks through the lens of spurious correlation”. In: ArXiv preprint abs/2305.11596 (2023). [87] Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, et al. “AlpaGasus: Training a Better Alpaca with Fewer Data”. In: The Twelfth International Conference on Learning Representations. 2024. [88] Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. “From shortcuts to triggers: Backdoor defense with denoised poe”. In: ArXiv preprint abs/2305.14910 (2023). [89] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks”. In: Research in Attacks, Intrusions, and Defenses. Ed. by Michael Bailey, Thorsten Holz, Manolis Stamatogiannakis, and Sotiris Ioannidis. Cham: Springer International Publishing, 2018, pp. 273–294. [90] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, et al. “Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks”. In: 2019 IEEE Symposium on Security and Privacy (SP). 2019, pp. 707–723. DOI: 10.1109/SP.2019.00031. [91] Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, et al. “Test-time backdoor mitigation for black-box large language models with defensive demonstrations”. In: arXiv preprint arXiv:2311.09763 (2023). [92] Greg Fields, Mohammad Samragh, Mojan Javaheripi, Farinaz Koushanfar, and Tara Javidi. “Trojan Signatures in DNN Weights”. In: IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021. IEEE, 2021, pp. 12–20. DOI: 10.1109/ICCVW54120.2021.00008. 74 [93] Weimin Lyu, Songzhu Zheng, Tengfei Ma, and Chao Chen. “A Study of the Attention Abnormality in Trojaned BERTs”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, 2022, pp. 4727–4741. DOI: 10.18653/v1/2022.naacl-main.348. [94] Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. “Rethinking Stealthiness of Backdoor Attack against NLP Models”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Online: Association for Computational Linguistics, 2021, pp. 5543–5557. DOI: 10.18653/v1/2021.acl-long.431. [95] Shafi Goldwasser, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. “Planting Undetectable Backdoors in Machine Learning Models : [Extended Abstract]”. In: 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). 2022, pp. 931–942. DOI: 10.1109/FOCS54457.2022.00092. [96] Georg Pichler, Marco Romanelli, Divya Prakash Manivannan, Prashanth Krishnamurthy, Farshad khorrami, and Siddharth Garg. “On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem”. In: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. Ed. by Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li. Vol. 238. Proceedings of Machine Learning Research. PMLR, 2024, pp. 4051–4059. [97] Ren Pang, Hua Shen, Xinyang Zhang, Shouling Ji, Yevgeniy Vorobeychik, Xiapu Luo, et al. “A Tale of Evil Twins: Adversarial Inputs versus Poisoned Models”. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. CCS ’20. Virtual Event, USA: Association for Computing Machinery, 2020, pp. 85–99. DOI: 10.1145/3372297.3417253. [98] Mantas Mazeika, Andy Zou, Akul Arora, Pavel Pleskov, Dawn Song, Dan Hendrycks, et al. How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans. 2023. [99] Huaibing Peng, Huming Qiu, Hua Ma, Shuo Wang, Anmin Fu, Said F. Al-Sarawi, et al. “On Model Outsourcing Adaptive Attacks to Deep Learning Backdoor Defenses”. In: IEEE Transactions on Information Forensics and Security 19 (2024), pp. 2356–2369. DOI: 10.1109/TIFS.2024.3349869. [100] Rui Zhu, Di Tang, Siyuan Tang, Guanhong Tao, Shiqing Ma, Xiaofeng Wang, et al. “Gradient shaping: Enhancing backdoor attack against reverse engineering”. In: arXiv preprint arXiv:2301.12318 (2023). 75 [101] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, et al. “Finetuned Language Models are Zero-Shot Learners”. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [102] Enkelejda Kasneci, Kathrin Sessler, Stefan Kuchemann, Maria Bannert, ¨ Daryna Dementieva, Frank Fischer, et al. “ChatGPT for good? On opportunities and challenges of large language models for education”. In: Learning and Individual Differences 103 (2023), p. 102274. DOI: https://doi.org/10.1016/j.lindif.2023.102274. [103] Som S. Biswas. “Role of Chat GPT in Public Health”. In: Annals of Biomedical Engineering 51 (2023), pp. 868–869. [104] Chao Li, Xing Su, Chao Fan, Haoying Han, Cong Xue, and Chunmo Zheng. “Quantifying the impact of large language models on collective opinion dynamics”. In: ArXiv preprint abs/2308.03313 (2023). [105] Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. “Whose opinions do language models reflect?” In: ArXiv preprint abs/2303.17548 (2023). [106] Chenyan Jia, Michelle S Lam, Minh Chau Mai, Jeff Hancock, and Michael S Bernstein. “Embedding Democratic Values into Social Media AIs via Societal Objective Functions”. In: ArXiv preprint abs/2307.13912 (2023). [107] Antonio Emanuele Cina, Kathrin Grosse, Ambra Demontis, Sebastiano Vascon, ` Werner Zellinger, Bernhard A. Moser, et al. “Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning”. In: ACM Comput. Surv. 55.13s (2023). DOI: 10.1145/3585385. [108] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. “Jailbroken: How Does LLM Safety Training Fail?” In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. [109] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. AISec ’23. Copenhagen, Denmark: Association for Computing Machinery, 2023, pp. 79–90. DOI: 10.1145/3605764.3623985. 76 [110] Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, et al. “Datasets: A Community Library for Natural Language Processing”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 175–184. DOI: 10.18653/v1/2021.emnlp-demo.21. [111] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, et al. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. 2023. URL: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercia lly-viable-instruction-tuned-llm (visited on 06/30/2023). [112] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, et al. “Self-Instruct: Aligning Language Models with Self-Generated Instructions”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 13484–13508. DOI: 10.18653/v1/2023.acl-long.754. [113] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, et al. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca. 2023. [114] Malak Abdullah, Alia Madain, and Yaser Jararweh. “ChatGPT: Fundamentals, Applications and Social Impacts”. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS). 2022, pp. 1–8. DOI: 10.1109/SNAMS58071.2022.10062688. [115] Eugene Bagdasaryan and Vitaly Shmatikov. “Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures”. In: 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022. IEEE, 2022, pp. 769–786. DOI: 10.1109/SP46214.2022.9833572. [116] Emilio Ferrara. “Should ChatGPT be biased? Challenges and risks of bias in large language models”. In: First Monday (2023). DOI: 10.5210/fm.v28i11.13346. [117] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, et al. “Evaluating large language models trained on code”. In: ArXiv preprint abs/2107.03374 (2021). [118] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, et al. “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis”. In: The Eleventh International Conference on Learning Representations. 2023. 77 [119] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, et al. “Llama: Open and efficient foundation language models”. In: ´ ArXiv preprint abs/2302.13971 (2023). [120] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. “On the Exploitability of Instruction Tuning”. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. [121] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, et al. “Wizardlm: Empowering large language models to follow complex instructions”. In: ArXiv preprint abs/2304.12244 (2023). [122] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [123] Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, et al. “Prompt Injection attack against LLM-integrated Applications”. In: ArXiv preprint abs/2306.05499 (2023). [124] Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. “Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models”. In: ArXiv preprint abs/2305.14710 (2023). [125] Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. “Concealed Data Poisoning Attacks on NLP Models”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 139–150. DOI: 10.18653/v1/2021.naacl-main.13. [126] Lichang Chen, Minhao Cheng, and Heng Huang. “Backdoor Learning on Sequence to Sequence Models”. In: arXiv preprint arXiv:2305.02424 (2023). [127] Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. “Poisoning Language Models During Instruction Tuning”. In: Proceedings of the 40th International Conference on Machine Learning. Ed. by Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 35413–35425. [128] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, et al. “A general language assistant as a laboratory for alignment”. In: ArXiv preprint abs/2112.00861 (2021). [129] Charlie Snell, Dan Klein, and Ruiqi Zhong. “Learning by distilling context”. In: ArXiv preprint abs/2209.15189 (2022). 78 [130] Eunbi Choi, Yongrae Jo, Joel Jang, Joonwon Jang, and Minjoon Seo. “Fixed Input Parameterization for Efficient Prompting”. In: Findings of the Association for Computational Linguistics: ACL 2023. Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 8428–8441. DOI: 10.18653/v1/2023.findings-acl.533. [131] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, et al. “Multitask Prompted Training Enables Zero-Shot Task Generalization”. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [132] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, et al. “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”. In: Proceedings of the 40th International Conference on Machine Learning. Ed. by Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 22631–22648. [133] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. 2023. [134] Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, et al. “LIMA: Less Is More for Alignment”. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. [135] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: ArXiv preprint abs/2307.09288 (2023). 79
Abstract (if available)
Abstract
Language models (LMs), which are pretrained on massive text data to encode knowledge through comprehending human languages, have demonstrated great success in solving a wide range of real-world tasks through transfer learning or zero-shot prompting. They have revolutionized the field of Natural Language Processing (NLP) and become the backbone for a lot of modern machine learning systems. While these models are increasingly integrated into critical applications, new challenges emerge across the model's life cycle, from data collection to model learning and serving. As the potential costs of errors escalate, it becomes paramount to explore how to audit and improve the reliability of machine learning systems. This thesis focuses on the vulnerabilities introduced during the training phase of language models. With the growing cost of collecting high-quality data and training large models, it has been increasingly difficult for developers to maintain comprehensive control over the entire model training pipeline, which makes it prevalent to incorporate untrusted resources into the training process. This shift amplifies the risk of training-time threats, with poisoning attacks being a noticeable one. By providing malicious data or pretrained models, an attacker can cause exploitable behaviors into the final models if the practitioners incorporate these malicious resources into the training pipeline. In this thesis, I will introduce my work on poisoning attacks and defenses in language models. I explore the potential harms caused by poisoning attacks, measure the risks by examining existing mitigation methods, and propose novel defense strategies to enhance the security of the training process.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Building generalizable language models for code processing
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Aggregating symbols for language models
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Grounding language in images and videos
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Common ground reasoning for communicative agents
PDF
Towards generalized event understanding in text via generative models
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Fairness in natural language generation
PDF
Security and privacy in information processing
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Language understanding in context: incorporating information about sources and targets
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Neural creative language generation
Asset Metadata
Creator
Yan, Jun
(author)
Core Title
Identifying and mitigating safety risks in language models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
09/05/2024
Defense Date
07/08/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI safety,language models,natural language processing,poisoning attacks
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Dehghani, Morteza (
committee member
), Jia, Robin (
committee member
)
Creator Email
junyan@alumni.usc.edu,yanjun@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399AGPC
Unique identifier
UC11399AGPC
Identifier
etd-YanJun-13489.pdf (filename)
Legacy Identifier
etd-YanJun-13489.pdf
Document Type
Dissertation
Format
theses (aat)
Rights
Yan, Jun
Internet Media Type
application/pdf
Type
texts
Source
20240905-usctheses-batch-1208
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AI safety
language models
natural language processing
poisoning attacks