Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational foundations for mixed-motive human-machine dialogue
(USC Thesis Other)
Computational foundations for mixed-motive human-machine dialogue
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTATIONAL FOUNDATIONS FOR MIXED-MOTIVE HUMAN-MACHINE DIALOGUE by Kushal Chawla A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Kushal Chawla I dedicate my dissertation to MY FAMILY, for their infinite support, faith, and love MY CAR ELANTRA, for therapeutic driving sessions every weekday, 9 - 9:30 AM and 8 - 8:30 PM ii Acknowledgments I want to express my deepest gratitude to my advisors, Gale Lucas and Jonathan Gratch, for making a Doctor out of nothing more than an enthusiastic kid who knocked on their doors five years ago. I am grateful to them for their constant support, for the insulation from the ups and downs, and especially for their guidance so that I could pursue my interests. I feel extremely privileged to have been a part of the amazing ICT1 family. A special thanks to Alesia Gainer, my local mommy, for all her help throughout my time here at ICT, to James Hale, for taking care of my car Elantra when I was not around, and to Deuksin (Brian) Kwon, for deep affection and warmth. Thanks, Brian, for letting me go – for not asking Peter to fail me so I can graduate. I am especially indebted to Professor Jonathan May for allowing me to be a visiting member of his lab group at USC and to his students – Justin Cho, Mozhdeh Gheini, and Meryem M’hamdi, among others, for allowing me to learn and grow with them. I thank the other members of my thesis committee, Peter Kim and Maja Mataric, along with the anonymous reviewers of my research, for their feedback that shaped the work presented in this dissertation. Emmanuel Johnson and Thamme Gowda guided me at every step and entertained my weirdest questions. My sincere thanks to Professor Zhou Yu at Columbia University and her students, especially Weiyan Shi, who helped me initiate a new workshop series. I would like to also acknowledge the efforts of all the 1 Institute for Creative Technologies at the University of Southern California iii brilliant Undergrads and Master students I got an opportunity to work with – Jaysa Ramirez, Rene Clever, Ian Wu, Yu Rong, Tara Kulshrestha, and Emily Weiss. I am extremely fortunate to have shared this journey with my buddy, Avi (Avijit Thawani), not just for the helpful everyday discussions but also for the good times we spent traveling together. My roommate Saby (Sabyasachee Baruah), the kindest person that I know of on this planet, unknowingly also made me a good man and helped me handle every challenge with utmost calm and kindness. I am thankful to the USC Bridges family – Sara Bauer, Spencer Bauer, John Book, Johannah Murray Sanger, Maurissa Edwards, Bob Britton, Elisa Engstrom, among others, for all their love and for keeping me sane through the loneliness of the pandemic. Finally, I appreciate my parents and brother for not sleeping through all the boring discussions in which I tried to explain my research to them – I could not have done this without their support and encouragement. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1 Gaining Insights from Human-Human Mixed-Motive Dialogues . . . . . . 6 1.1.2 Designing Dialogue Models Suitable for Mixed-Motive Interactions . . . . 7 1.1.3 Establishing the Research Community . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Multi-Issue Bargaining Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Menu-Driven Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Pilot: Winner of the HAN Challenge at ANAC (IJCAI 2020) . . . . . . . . . 18 2.3.2 Towards More Realistic Communication Channels . . . . . . . . . . . . . . 22 2.4 Research on Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: The CaSiNo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Strategy Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Prosocial Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Proself Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Correlational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Strategy Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 v Chapter 4: A Case for Emotion-Aware Mixed-Motive Systems . . . . . . . . . . . . . 46 4.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 Individual Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.2 Capturing Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.3 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Analysis of Emotion Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Correlation Among Emotion Dimensions . . . . . . . . . . . . . . . . . . . 54 4.2.2 Lexical Correlates of Emotion . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2.3 Sample Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Correlation with Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.3 Controlling for the Objective Performance . . . . . . . . . . . . . . . . . . 61 4.3.4 Comparing Emotion Recognition Methods . . . . . . . . . . . . . . . . . . 64 4.3.5 Studying Temporal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Do These Findings Generalize? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 5: Data Augmentation Improves Partner Modeling . . . . . . . . . . . . . . 73 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.1 Hierarchical Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2 Data Augmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.2.1 Capturing Preferences . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.2.2 Capturing Offers . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 RQ1: Are Data Augmentations Useful? . . . . . . . . . . . . . . . . . . . . 86 5.4.2 RQ2: Do Preferences and Offers Help? . . . . . . . . . . . . . . . . . . . . 88 5.4.3 RQ3: Comparison to a Human Expert . . . . . . . . . . . . . . . . . . . . . 89 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 6: Be Selfish, But Wisely: Designing Mixed-Motive Dialogue Models . . . 92 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.1 Self-Play RL for Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.2.1 Varying the Reward Function . . . . . . . . . . . . . . . . . . . . 96 6.1.2.2 Varying the RL Partner . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.1 Automated Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.2 Subjective Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4.3 Measuring Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 vi Chapter 7: Evaluating Large Language Models in Mixed-Motive Contexts . . . . . 110 7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1.2 Task Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.1.2.1 Objectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.1.2.2 Time Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.1.2.3 Task Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.1.3 Prompting for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.2 LLMs for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.1 Comprehension Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.2 Annotation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.3 Partner Modeling Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.4 Generation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4.1 CoT Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4.2 Using Prior Utterances for Annotation Tasks . . . . . . . . . . . . . . . . . 125 7.4.3 Few-shot Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.4.4 Varying the Number of Seen Utterances . . . . . . . . . . . . . . . . . . . 126 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Chapter 8: Mixed-Motive Tasks Across Diverse Domains: A Literature Review . . 128 8.1 Expanding the Dialogue Research Categorization . . . . . . . . . . . . . . . . . . 130 8.2 Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.3 Developed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.3.1 Strategy Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.3.2 Language Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.3.3 Partner Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 9: Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Chapter 10: Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 The CaSiNo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 A.1 Pre-Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 A.2 Preparation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.3 Data Post-Processing steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A.4 Participant Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 vii A.5 Correlational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A.6 Strategy Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 A.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 A.6.2 Computing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 A.6.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 A.7 Screenshots from the Data Collection Interface . . . . . . . . . . . . . . . . . . . . 180 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A Case for Emotion-Aware Mixed-Motive Systems . . . . . . . . . . . . . . . . . . . . . 184 B.1 Correlation Among Emotion Dimensions . . . . . . . . . . . . . . . . . . . . . . . 184 B.2 Lexical Correlates of Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 B.3 Controlling for Objective Performance . . . . . . . . . . . . . . . . . . . . . . . . 184 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Data Augmentation Improves Partner Modeling . . . . . . . . . . . . . . . . . . . . . . 189 C.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 C.1.1 Computing Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 C.1.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 C.1.3 External Packages and Frameworks . . . . . . . . . . . . . . . . . . . . . . 190 C.2 Regular Expression Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 C.2.1 Adapting DealOrNoDeal data . . . . . . . . . . . . . . . . . . . . . . . . . 191 C.2.2 Identifying Offer Statements . . . . . . . . . . . . . . . . . . . . . . . . . . 192 C.3 Comparison with Human Performance . . . . . . . . . . . . . . . . . . . . . . . . 192 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Be Selfish, But Wisely: Designing Mixed-Motive Dialogue Models . . . . . . . . . . . . 193 D.1 Problematic Self-Interested Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 193 D.2 Human Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 D.3 Sample Conversations from Human Evaluation . . . . . . . . . . . . . . . . . . . . 197 Appendix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Evaluating Large Language Models in Mixed-Motive Contexts . . . . . . . . . . . . . . 201 E.1 Details of Negotiation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 E.2 Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 E.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 E.4 Task-Specific Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 E.5 Generation Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Appendix F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Mixed-Motive Tasks Across Diverse Domains: A Literature Review . . . . . . . . . . . 213 F.1 Literature Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 F.2 Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 F.3 Five-Stage Theoretical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 viii List of Tables 3.1 Sample dialogue from the CaSiNo dataset. P1 and P2 represent two participants in our data collection study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Utterance-level strategy annotations. refers to Krippendorff’s alpha among 3 annotators on a subset of 10 dialogues (∼ 120 utterances). An utterance can have multiple labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Performance on strategy prediction task for 5-fold cross validation. F1 score corresponds to the positive class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Statistics and correlations with outcome variables for all continuous measures. * denotes < 0.05. ** denotes < 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 High confidence sample predictions for the deep learning models that went undetected by both Emoticon and LIWC methods. . . . . . . . . . . . . . . . . . . . 58 4.3 Regression results for predicting Satisfaction. ** denotes < 0.01. *** denotes < 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Regression results for the prediction of Liking. *** denotes < 0.001. . . . . . . . 62 ix 4.5 Regression results for predicting Satisfaction and Liking with emotion recognition techniques pitted against each other. We only consider the best-performing deep learning model (T5-Reddit) for this analysis. Further, we combine the affect variables corresponding to the participant and the partner in the same step. * denotes < 0.05. ** denotes < 0.01. *** denotes < 0.001. . . . . . . . . . . . . 65 4.6 Temporal analysis for predicting Satisfaction and Liking. We only consider T5-Reddit dimensions due to their superiority in other experiments. We combine the affect variables corresponding to the participant and the partner in the same step. *** denotes < 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.7 Regression results on F2F dataset using T5-Reddit emotion model. * denotes < 0.05. ** denotes < 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Performance on the partner modeling task, showing the utility of the proposed methods. EMA and Top-1 represent the accuracy in percentage. We also scaled NDCG@3 to 0-100. For all the metrics, higher is better. The numbers represent Mean (Std.) over 5-cross folds of the CD data. . . . . . . . . . . . . . . . . . . . . 86 5.2 Performance for RoBERTa-based models: (a) argument classification accuracy on the validation set of CA, (b) EMA at =2 for partner modeling on the validation set of DND. The numbers represent Mean (Std.) over 5-cross folds. . . . . . . . . 87 6.1 A sample problematic negotiation dialogue between the standard RL agent (Alice) and a supervised model (Bob), based on Lewis et al. (2017). The task here is to divide the available books, hats, and balls between the two players. In this case, Bob accepts a deal even though it is very unfavorable, resulting in a high score for Alice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 x 6.2 Demonstration of reflected personalities by varying the parameters and from Equation 6.1. The variants used in this work are highlighted in blue. . . . . . . . . 97 6.3 Results from the human evaluation study. We report the Mean (Standard Error) wherever applicable. The Joint points are scored by computing the mean over the sum of the points scored by both players – an indicator of the joint value created in the negotiation. The maximum possible points for a player in a negotiation is 10. ↑: Higher is better, ↓: Lower is better. In each column, we highlight the worst and the best scores in red and blue respectively. We discuss the significant trends in Sections 6.3 and 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Example conversation between the =selfish =selfish agent and a human partner in our experimental study. The agents helps to find a solution that leads to high performance for both players. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1 The datasets used in our analysis. We sample 200 test instances for our evaluation. The training data is used for fine-tuning our baseline model. . . . . . . . . . . . . 114 7.2 Results on response generation. BLEU and Rouge-L are computed on the full testset (200 instances). Coherency and Strategy are based on a human evaluation of a random subset of 50 examples on a scale of 1-5. ∗ means significantly worse performance than the Human reference according to a Paired Bootstrap test. . . . 122 8.1 Overview of the available dialogue datasets. This list is non-exhaustive, and also covers the datasets that have enabled research into various sub-tasks and analyses that can eventually be useful for dialogue systems in respective domains. Key statistics and associated metadata are in Appendix F.1. . . . . . . . . . . . . . . . . 133 xi 8.2 Categorization of available methods (non-exhaustive). We only cover papers that explicitly design a dialogue system. NLG: Natural Language Generation, PLM: Pretrained Language Model, MIBT: Multi-Issue Bargaining Task, E-Com: E-Commerce, DA: Dialogue Act, Enc: Encoder, Dec: Decoder, SL: Supervised Learning, RL: Reinforcement Learning. Methods that use RL usually apply it in conjunction with SL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.1 The camping scenario description as seen by the participants in our data collection. 174 A.2 Example arguments that the participants come up with for their individual requirements during the preparation phase. The categories defined are not exhaustive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 A.3 A few positive feedback responses that we obtained from the participants during the collection of the CaSiNo dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A.4 Pearson Correlation Coefficients (r) between the outcome variables. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the entire CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A.5 Pearson Correlation Coefficient (r) between integrative potential and the joint negotiation performance. *** denotes significance with < 0.001. . . . . . . . . . 178 xii A.6 Pearson Correlation Coefficients (r) for strategy annotation counts with the outcome variables. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the annotated subset of the CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). . . . . . . . . . 178 A.7 Pearson Correlation Coefficients (r) between strategy annotation counts. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the annotated subset of the CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). . . . . . . . . . . . . . . . . . . . . . . . . . . 179 A.8 Training details for the strategy prediction task. The Overall F1 scores are for the positive class. For LR-BoW, the exact number of features varies slightly based on the CV split. Hence, we report Mean (Std) across the five splits. . . . . . . . . . . . 180 B.1 Correlations of emotion variables among each other. (a) T5-Twitter variables among each other, (b) T5-Reddit variables among each other, (c) T5-Twitter variables against T5-Reddit variables, (d) T5-Twitter variables against Emoticon and LIWC variables, (e) T5-Reddit variables against Emoticon and LIWC variables. We only consider the dimensions that are common across all methods. * denotes < 0.05. ** denotes < 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 B.2 Top 5 words based on the log-odds ratio for (a) Emoticon (b) LIWC, (c) T5-Twitter, and (d) T5-Reddit methods. * denotes < 0.05. ** denotes < 0.01. *** denotes < 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 xiii B.3 Regression results for predicting Satisfaction after controlling for the points scored by the participants. ** denotes < 0.01. *** denotes < 0.001. . . . . . . . 187 B.4 Regression results for the prediction of Liking after controlling for the points scored by the participants. *** denotes < 0.001. . . . . . . . . . . . . . . . . . . . 188 C.1 Validation performance for partner modeling on CD dataset. The reported EMA is at =5. The numbers represent Mean (Std.) over 5-cross folds of the CD data. . . . 191 D.1 Statistics from the human evaluation study. . . . . . . . . . . . . . . . . . . . . . . 197 D.2 Sample human evaluation output for =fair =fair . . . . . . . . . . . . . . . . . . . . . . 197 D.3 Sample human evaluation output for =selfish =fair . . . . . . . . . . . . . . . . . . . . . 198 D.4 Sample human evaluation output for = =fair. . . . . . . . . . . . . . . . . . . . . . 198 D.5 Sample human evaluation output for = =selfish. . . . . . . . . . . . . . . . . . . . . 199 D.6 Sample human evaluation output for =fair =selfish. . . . . . . . . . . . . . . . . . . . . 199 D.7 Sample human evaluation output for =selfish =selfish . . . . . . . . . . . . . . . . . . . . . 200 E.1 An overview of the designed tasks for evaluating LLMs in negotiations. CRA: (DeVault et al., 2015), DND: (Lewis et al., 2017), CA: (Chawla et al., 2021b), JI: (Yamaguchi et al., 2021). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 E.2 Distribution of the designed tasks by the dataset and time stage. . . . . . . . . . . 203 E.3 Selected tasks for computing the average bar plots in Figure 7.3. . . . . . . . . . . 203 xiv E.4 Task-wise results for all models. Empty values for Majority and Flan-T5 correspond to the tasks on which these baselines were not evaluated (since a simple rule-based baseline can achieve 100% performance in these cases). Empty values in all other cases correspond to scenarios where the LLM model failed to follow task-specific instructions, making the measured scores too unreliable to report. Notes: 1) The results on response generation are provided in Table 7.2, and 2) For the cases where the models guess the outputs for each issue (like books, balls, or hats in DND) separately, we simply report the average score across all issues. The results for generation tasks (i.e., dur_gen_resp_ca, dur_gen_resp_dnd) are presented in Table 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 E.5 Prompt example for the sta_ask_point_values_ca task. . . . . . . . . . . . . . . . . 205 E.6 Prompt example for the dur_full_proposal_dnd task. . . . . . . . . . . . . . . . . . 205 E.7 Prompt example for the end_deal_specifics_ca task. . . . . . . . . . . . . . . . . . 206 E.8 CoT prompt example for the end_deal_total_ca task. . . . . . . . . . . . . . . . . . 207 E.9 Questions used in the prompts for all tasks. . . . . . . . . . . . . . . . . . . . . . . 208 E.10 Sample responses generated by the models for an instance from the DND dataset. The Coherency and Strategy scores are based on human annotations on a scale of 1-5. 212 E.11 Sample responses generated by the models for an instance from the CaSiNo dataset. The Coherency and Strategy scores are based on human annotations on a scale of 1-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 xv F.1 Dialogue datasets grouped by task domains and sorted by publishing year within a domain. Statistics of the form X/Y denote average numbers. MIBT: Multi-Issue Bargaining Task. ∗Only computed for dialogues with additional survey responses, +Only computed for training data. #Only for Speaker utterances in front of the judge (does not include other members of the court). Not all datasets listed above have been directly used for designing end-to-end dialogue systems, but instead, these have enabled research into various useful sub-tasks and analyses. Please refer to Section 8.2 for a detailed discussion about these datasets and to Section 8.3 for information about various methods developed using them. . . . . . . . . . 217 xvi List of Figures 2.1 A screenshot of the IAGO negotiation platform. The interface provides a menu of options to receive inputs from the human players. . . . . . . . . . . . . . . . . . . 17 2.2 Overview of a modular task-oriented dialogue system, taken from Gao et al. (2018). NLU: Natural Language Understanding, DM: Dialogue Manager, DST: Dialogue State Tracking, POL: Dialogue Policy, NLG: Natural Language Generation, DB: External Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Architecture for multi-task strategy prediction. + represents element-wise summation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Visualizing task-specific self-attention layers for two examples from the test dataset for the first CV fold. The heatmap shows the attention scores for each token in the utterance for corresponding strategy labels. . . . . . . . . . . . . . . 44 5.1 A simplified view of a multi-issue negotiation based on the scenario in CaSiNo (Chawla et al., 2021b). The negotiation involves 3 issues: Food, Water, and Firewood, each with 3 items that must be divided among the two players. From the perspective of player P1, the task of partner modeling considered in this work involves inferring the priority order of partner P2 from the interaction between the two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 xvii 5.2 Our proposed methodology for partner modeling in negotiation dialogues. The approach involves three main components: Section 3.1 describes our core hierarchical encoder that takes in a partial dialogue and outputs the partner priority order after seeing each utterance, Section 3.2.1 covers the adaptation of an argument-centric dataset (CA data) targeted towards better modeling the preference statements of the partner, and Section 3.2.2 describes the adaptation of an offer-centric dataset (DND data) targeted towards the offer statements of the partner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Mean performance for two RoBERTa-based models: (a) on different percentages of CD data. The Y-Axis represents EMA at k=5, (b) on different values of . . . . . . 87 6.1 The three-stage process used to design the 6 dialogue agents for our 2 x 3 study. : Reward that the RL agent is trained to maximize. : The partner with which the RL agent is trained. = corresponds to the model trained in Stage 1, while =selfish and =fair correspond to the respective models trained in Stage 2. . . . . 99 6.2 Heatmaps depicting the results from 388 agent-agent interactions. Each cell denotes the points scored (out of 10) by the Alice variant (X-Axis) when it interacts with the corresponding Bob model (Y-Axis). . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Heatmaps depicting the results from 388 agent-agent interaction. Each cell denotes the mean joint points scored by the corresponding Alice model variant (X-Axis) and the Bob variant (Y-Axis). . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Subjective assessment by humans. Both metrics are measured on a scale of 1 to 5. 106 xviii 7.1 Overview of the key capabilities (C1-C4) required for a successful negotiation. We design tasks aligned with these abilities to assess how LLMs can advance different aspects of negotiation research. The negotiation scenario is based on Chawla et al. (2021b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Our methodology for systematically evaluating LLMs in negotiation dialogues. Part A (top) describes the pipeline for creating task-specific prompts from a negotiation dataset and evaluating various LLMs with them. Part B (bottom) depicts the tasks categorized by Objectivity, Time Stage, and Task Type (Section 7.1.2). 113 7.3 Overall results for zero-shot evaluation of LLMs. F1: macro F1 over all labels, PCC: Pearson Correlation Coefficient. Each bar shows the average result across all suitable tasks in the category. For example, as per (b), GPT-4 gets 65.3% Accuracy on average for Comprehension tasks in End time stage. Selected tasks for building these plots and full results are in Appendix E.1. . . . . . . . . . . . . . . . . . . . . 119 8.1 A conceptual overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.1 Screenshots from the data collection interface: Task Preview. This is a brief task description that the MTurkers see before signing up for our data collection task. . 180 A.2 Screenshots from the data collection interface: Participant On-boarding. . . . . . . 181 A.3 Screenshots from the data collection interface: Chat Interface. . . . . . . . . . . . 182 A.4 Screenshots from the data collection interface: Post-Survey. Once the deal is accepted (or someone walks away), both the participants are asked to fill in the post-survey having the above questions. The figure contains dummy responses. . 183 C.1 Mean performance comparison for the best-performing model with the human expert for different values of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 xix D.1 A sample negotiation dialogue between two copies of the standard RL agent based on Lewis et al. (2017). The task here is to divide the available books, hats, and balls between the two players. In this case, the agents get stuck – both continuously asking what they want without looking for a compromise. . . . . . . . . . . . . . 194 D.2 Screenshot from the human evaluation study. The participants first negotiate with a randomly assigned dialogue agent using the chat feature on the right side. Once an agreement is reached, the participant is asked to enter the agreed deal using the options on the left. The participant was also allowed to walk away from the conversation without agreement. The participant was allowed to submit a deal or walk away after at least one turn. . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 E.1 Evaluation on four tasks demonstrating the impact of 1) Two-shot prompting and 2) Incorporating two prior utterances. We only consider GPT-4 for this analysis. F1 refers to the macro F1 score over all labels. . . . . . . . . . . . . . . . . . . . . 209 E.2 Evaluation on four tasks demonstrating the impact of CoT prompting. We only consider GPT-4 for this analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 E.3 Evaluation on four tasks demonstrating the impact of increasing the number of utterances seen by the models. We only consider GPT-3.5 and GPT-4 for this analysis. X-Axis denotes the number of utterances from the start of the dialogue that are used in the prompts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 xx F.1 A theoretical framework for designing dialogue models for mixed-motive tasks. Curved arrows represent forward relations and the straight arrow represents the feedback. I. Task Specifications: Key properties that define the task in consideration and are captured by the collected dataset, II. Chatbot Characteristics and User Backgrounds: Attributes for the agent design and target audience, III. Chatbot Capacity: The desirable capabilities of the system, IV. Chatbot Design & Techniques: The modeling techniques to develop the dialogue system, and V. Evaluation Mechanisms: Metrics to evaluate system performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 xxi Abstract Social interactions often involve a mixture of motives. People seek to maximize their own interests without undermining the needs of others. Success in these interactions, referred to as mixed-motive interactions, demands a balance between self-serving and other-serving motives. For instance, in a typical negotiation, a player must balance maximizing their own goals with the goals of their partner so as to come to an agreement. If the player asks for too much, this can push the partner to walk away without an agreement, hence hurting the outcomes for all the parties involved. Such interactions are highly prevalent in everyday life, from deciding who performs household chores to customer support and high-stakes business deals. Consequently, automated systems capable of comprehending and participating in these strategic environments with human players find broad downstream applications. This includes advancing conversational assistants and the development of tools that make everyday social interactions more effective and efficient (e.g., by acting as a content moderator or a coach). Additionally, these systems hold a huge potential to transform pedagogical practices by dramatically reducing costs and scaling up social skills training. Most efforts for automation focus on agent-agent interactions, where thousands of offers are exchanged between the players. These interactions are fundamentally different from human-agent conversations, which are much shorter and naturally involve human subjectivity, which in fact, has been a subject matter of research for decades across several disciplines, including Psychology, xxii Affective Computing, and Economics. Hence, in order to simplify the design, most efforts in human-agent negotiations involve restrictive menu-driven communication interfaces that are based on button clicks and structured APIs for interaction between the human and the machine. This concreteness reduces the design complexity but it comes at a cost – such interfaces hinder the study and incorporation of several aspects of real-world negotiations such as complex strategies and emotion expression. Going beyond such constrained designs, it is desirable to incorporate more realistic modes of communication, such as natural language, for their utility in better serving the downstream applications – our work aims to fill this gap. In this dissertation, we present our foundational work for enabling mixed-motive humanmachine dialogue, with a focus on bilateral chat-based negotiation interactions. We discuss our progress in three key areas: 1) The design of a novel task and dataset of grounded human-human negotiations that fueled our investigations into the role of emotion expression and linguistic strategies, 2) Techniques for dialogue systems capable of engaging in mixed-motive interactions by learning to strike a balance between self and partner interests, and 3) Defining a research space encompassing such strategic dialogue interactions to promote a research community for dedicated efforts and discussion in this area. xxiii Chapter 1 Introduction From the early days of ELIZA (Weizenbaum, 1966), a rule-based dialogue agent that searches for keywords in the user input and maps these keywords to output responses using predefined rules, to the GPT family (Brown et al., 2020; Achiam et al., 2023), multi-billion parameter deep learning models trained on humongous amount of data from the web and capable of engaging in sophisticated conversations on a variety of topics, we have come a long way in building Artificially Intelligent (AI) agents that can communicate with us in natural language. Recent advancements in highly parallelizable deep learning architectures such as the Transformer (Vaswani et al., 2017) and the associated hardware improvements have revolutionized the field of Natural Language Processing (NLP). There has essentially been a paradigm shift over the last few years. Instead of training the model from scratch on a given task, fine-tuning language models like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) with limited in-domain datasets shows excellent performance across a number of tasks in NLP. Furthermore, Large Language Models (LLMs) (Achiam et al., 2023; Team et al., 2023), which are trained at a much larger scale, have been shown to perform well even in zero-shot and few-shot scenarios. These advancements have benefitted the dialogue systems research as well (Wang et al., 2023; Manyika and Hsiao, 2023) – Yet, these 1 systems fall behind in comprehending or participating in various common real-world scenarios, limiting their utility in realistic settings. This dissertation is simply an attempt to fill this gap, not only by formally characterizing these left-out real-world scenarios but also by developing novel techniques in this direction and establishing a research landscape to promote principled, ethically grounded progress in the future. Dialogue systems have traditionally been categorized into task-oriented systems (Zhang et al., 2020c), which aim to passively assist users in accomplishing their tasks, and open-domain systems (Huang et al., 2020), which aim to engage users in general chit-chat on a variety of topics. For instance, a task-oriented dialogue system may help users reserve a table at a restaurant based on relevant slot information such as the group size, preferred cuisine, and the time of the booking (Henderson et al., 2014). Similar systems are designed to target various other applications as well, including flight/hotel booking (El Asri et al., 2017), movie booking (Shah et al., 2018), and multi-domain settings (Byrne et al., 2019; Yu et al., 2019). In essence, the dialogue systems for these tasks are designed to follow the cooperative principle from the Social Sciences and Linguistics literature – Grice’s four Maxims of quantity, quality, relation, and manner describe the rational principles that should be observed in a cooperative dialogue for effective communication and task success (Grice, 1975). However, when we dive deeper into everyday interactions, there exists a subtle gap between how these task-oriented problems are typically defined in the research community and how realistic conversations can unfold. Often, a customer support agent must balance the goal of assisting the users with the inherent goals of the business, relating to profits (e.g., cautiously approving refund requests) and managing user perception. A similar conundrum also arises when a driving assistant must recommend routes to customers, where the fastest route may not always 2 be the most environment-friendly. Hence, in realistic scenarios, the goals of the dialogue agents may conflict with each other, requiring the agent to come up with creative solutions to balance them. This often leads to a mixture of collaborative and non-collaborative interactions along with off-topic interactions, violating Grice’s Maxims. Such complex scenarios fall within the landscape of mixed-motive interactions, an old term formalized in Economics by Thomas C Schelling to capture situations where success depends on a balance between self-serving and other-serving motives (Schelling, 1958). As the author clarifies, “Mixed-motive refers not, of course, to an individual’s lack of clarity about his own preferences but rather the ambivalence of his relationship to the other player–the mixture of mutual dependence and conflict, of partnership and competition.”. A classic example of mixed-motive interaction is negotiation, which involves motivations for both self-oriented as well as cooperative and socialistic behaviors. For instance, a buyer negotiating with a seller over the price of a bike (He et al., 2018), or two players engaging in a dialogue to divide the given items with each other based on their individual preferences and requirements (Lewis et al., 2017). A successful negotiator must not only learn to extract concessions (from the partner) but also make concessions (for the partner) in order to reach an agreement. Maintaining this balance between self-interest and the interests of negotiation partners makes it a challenging task for automated dialogue agents. If an agent tries to take too much without any compromise, this can push the partner to walk away without an agreement, hurting the outcomes for both players. How humans negotiate has been extensively studied across multiple disciplines: Game Theory for identifying optimal behaviors (Nash Jr, 1950), Psychology for understanding human decision-making (Carnevale and Pruitt, 1992), Affective Computing for analyzing the role of emotions (de Melo et al., 2011; Stratou et al., 2015), and Computer Science for building automated 3 negotiation agents (Beam and Segev, 1997; Baarslag et al., 2016). Most efforts on designing agents for negotiations have looked at agent-agent interactions (Williams et al., 2012; Lin et al., 2014; Cao et al., 2018), although there is increasing interest in agent-human interactions as well, given the potential to serve a number of downstream applications. Mixed-motive negotiation interactions are ubiquitous in everyday life, ranging from deciding who performs the household chores to customer support, buyer-seller markets, high-stake business deals, and legal proceedings. Consequently, these are useful capabilities for advanced conversational assistants. For instance, the Google Duplex (Leviathan and Matias, 2018) agent was designed to engage in simplified negotiations and automatically make appointments over the phone. Second, automation can help to improve pedagogical practices (Johnson et al., 2019a). Addressing conflicts is one of the core interpersonal skills for students entering the workforce. However, the primary way to teach these skills is through in-class role-playing, which proves to be both inefficient and expensive. Hence, automated systems can help to dramatically reduce costs and scale up pedagogy for social skills training. Third, these systems can advise human behavior, making everyday interactions more effective. For instance, Zhou et al. (2019a) designed a negotiation coach for machine-in-the-loop buyer-seller interactions. The agent monitors the conversation and advises the seller on the appropriate strategies, leading to improved outcomes. And finally, automated systems can advance our understanding of human cognition by making the design of psychological experiments more feasible (Blascovich et al., 2002; Gratch et al., 2015). Efforts in automated systems for agent-human negotiation interactions have primarily looked at menu-driven systems that provide a fixed menu of options for the human player to interact with the system. One example is the Interaction Arbitration Guide Online, or the IAGO framework (Mell and Gratch, 2016). IAGO allows the human participants to interact with the agent by clicking 4 buttons to choose options from the given menu. Keeping the communication channel structured in this manner essentially simplifies the inputs received by the agent allowing the researchers to define a set of rules that guide the behavior of the agent. Unfortunately, this simplification comes at a cost - it is insufficient for capturing realistic effects such as linguistic strategies and emotion. In addition, the interface can be unnatural to human partners, leading to a high cognitive load and, thus, distorting conclusions. Hence, it is desirable to incorporate realistic modes of communication, such as natural language, which has eventually inspired dialogue systems research in this area. This dissertation describes our progress toward enabling mixed-motive human-machine dialogue. Dialogue systems for these applications must have a deep comprehension of the dialogue context, an understanding of the partner’s motives based on the ingested dialogue, along with abilities for strategic reasoning and effective communication to generate the next utterance in the conversation, making this undertaking a challenging testbed for AI research. In our work, we take a data-driven approach, gaining insights from realistic human-human mixed-motive dialogue and using them for subsequent modeling. Our task formulations primarily follow a closed-domain abstraction from the negotiation literature, referred to as the Multi-Issue Bargaining task or MIBT (Fershtman, 1990), which has been a popular framework of choice in both academic research and industrial applications. In addition, prior research has extensively pointed out the role of demographics and personality of the human negotiators along with the impact of affective behaviors expressed in the dialogue (Stuhlmacher and Walters, 1999; de Melo et al., 2011; Stratou et al., 2015; Xu et al., 2017). Hence, the dissertation reflects our interdisciplinary efforts, where we provide evidence of how prior theories in other disciplines can guide the research into designing automated systems for human-machine mixed-motive dialogue. 5 1.1 Key Contributions We now describe the three primary contributions made by this dissertation. These are aligned with the goals of making progress in mixed-motive human-machine dialogue as well as establishing a research landscape to support future progress in this area. Alongside, we provide an outline for the rest of the manuscript. 1.1.1 Gaining Insights from Human-Human Mixed-Motive Dialogues A natural place to start the investigation is by observing how human players negotiate with each other in realistic contexts. Further, the data-driven techniques in dialogue research often start with training supervised models on human-human conversational datasets (Chen et al., 2017) 1 . With this motivation, we design a novel negotiation task (an instance of the MIBT framework) involving two players who take the roles of campsite neighbors and are looking to get some extra camping essential items (packages of Food, Water, or Firewood). Inspired from real-world interactions, each player starts with not only their preferences towards different items (as standard in instances of MIBT) but also holds motivations grounded in personal experiences as to why they need items of a specific kind. The players would then engage in a conversation to divide the available items among each other, using their preferences and corresponding motivations to get favorable deals. Following this task design, we collected CaSiNo, a dataset of 1030 human-human chat-based dialogues between crowdsourcing workers on a data collection platform (Chawla et al., 2021b). After laying out the necessary background in Chapter 2, we discuss how CaSiNo improves 1Other techniques include crawling datasets from the web, constructing datasets in a semi-automatic manner from dialogue templates, or using a Wizard-in-the-loop mechanism. 6 upon the previously existing datasets, the design of the data collection study, as well as examples and overall statistics about the data in Chapter 3. CaSiNo captures realistic negotiation behaviors in an easily relatable camping scenario along with a rich collection of participant metadata, including personality traits and subjective assessments made by the players post-negotiation. This has enabled our subsequent investigations. Through additional utterance-level annotations of linguistic strategies, we analyze how selfish and cooperative behaviors exhibited by the players correlate with the outcomes in chat-based negotiations (Chawla et al., 2021b). In addition, these strategies can be detected automatically with multi-task learning. Further, we establish the value of capturing emotions expressed in the dialogues for predicting subjective negotiation outcomes, above and beyond the individual differences variables (Chawla et al., 2021a, 2023a). Our results provide evidence that dialogue agents designed for realistic contexts can benefit from being emotionally intelligent for achieving task success and maintaining a long-term relationship with users (Chapter 4). Finally, CaSiNo also led to our progress in partner modeling (the task of inferring partner preferences from dialogue), showing that data augmentation strategies can reduce the annotation effort typically employed for these tasks (Chapter 5). 1.1.2 Designing Dialogue Models Suitable for Mixed-Motive Interactions The insights gained from these early investigations motivated our later work in designing mixedmotive dialogue agents. Such an agent must learn to balance their own objectives with the goals of their partners – this is fundamental for achieving agreements in any mixed-motive interaction. We improve on the standard self-play Reinforcement Learning (RL) pipeline (Lewis et al., 2017) 7 for teaching the mixed-motive nature of the task to the dialogue models. We experiment with two ways to achieve this - first, where we explicitly force the model to take the partner’s preferences into account through a modified reward inspired by theories in Economics, and second, where the model implicitly learns to exhibit mixed-motive behaviors by engaging in simulated interactions with a tough RL partner (Chawla et al., 2023c). We present this study in Chapter 6, where we find that both of these approaches show promise. Given the remarkable success of LLMs across a variety of tasks in NLP, we explore the ways in which they can support negotiation research, from designing dialogue systems to improving data collection and pedagogical practices (Kwon et al., 2024). In this direction, we design a methodology to conduct a systematic evaluation of the multifaceted capabilities of LLMs on 38 tasks across 4 dialogue datasets. We focus on four core capabilities that are essential to a typical mixed-motive interaction: 1) Comprehension, 2) Partner Modeling, 3) Annotation, and 4) Generation. In Chapter 7, we discuss the overall results (attesting to the superiority of OpenAI’s GPT-4 in strategic reasoning tasks) and point out specific areas where the state-of-the-art LLMs still struggle. Our findings from this out-of-the-box evaluation of LLMs indicate that although the models perform reasonably well on subtasks such as annotations, they struggle to generate strategically appropriate responses conditioned on the conversation context and the dialogue history. Hence, in Chapter 9, we discuss two primary ways forward: first, by exploring whether LLMs can act as natural language interfaces for precisely controlled RL policies that are designed to handle the primary strategy of the dialogue agent, and secondly, by improving the strategic reasoning capabilities of LLMs themselves, such as through Chain-of-Thought prompting (Kojima et al., 2022), structured planning (Zhou et al., 2024), or performing a guided beam search to better calibrate the reasoning process (Xie et al., 2024) – this is still an active area of research and has 8 received significant interest in the NLP community due to its broader applications across a number of reasoning tasks. 1.1.3 Establishing the Research Community The typical categorization of dialogue systems into task-oriented versus open-domain fails to adequately represent the intricacies of mixed-motive interactions. This is not just limited to negotiations but also extends to other real-world interactions such as therapy and argumentation dialogues. Unfortunately, existing efforts for automation focusing on these domains have been limited and scattered. However, given that these domains commonly reflect mixed-motive behaviors, we hypothesize that a more unified outlook can help to make principled progress in the future. This includes sharing data collection best practices, establishing ethical grounds, and designing transfer learning approaches that benefit from the data in seemingly diverse domains. Through new external collaborations (Chawla et al., 2023b), we worked to formally recognize this new category of dialogue systems and conducted a literature review of the existing datasets and methods (Chapter 8). In addition, we provide recommendations for future work and ethical guidelines, which we discuss in Chapters 9 and 10, respectively. To encourage discussion within the research community, we also initiated a new NLP workshop series in this area (1st Edition @ ACL 20232 ; 2nd Edition @ EMNLP 2024). 2https://sites.google.com/view/sicon-2023/home 9 Chapter 2 Background We now provide the cross-disciplinary background relevant to the research presented in this dissertation. We will start by discussing the prior literature on mixed-motive interactions, introducing the bargaining task, and describing the necessary terminology. This will be followed by a summary of recent advances in NLP that guide the methodologies developed in later Chapters. 2.1 A Brief History The term mixed-motive was coined by Economist Thomas C. Schelling as an attempt to expand the scope of Game Theory (Schelling, 1958). Schelling argues that traditionally, the insights from research in game theory had only been useful in the context of zero-sum decision-making games, which involve a pure conflict or a pure competition between the players. In these situations, the benefits of one player necessarily result in a loss for the other (such as in a game of chess). On the other extreme, there are purely collaboration games in which the wins and the losses of the players are co-dependent, leaving no room for a conflict in their preferences (such as a game of charades). Mixed-motive situations, as formalized by Schelling, lie in between these extremes – 10 they involve possibilities for both conflict and mutual dependence (such as in the case of salary negotiations). Consequently, in order to reach a favorable agreement in a negotiation, players must find the right balance between serving their own interests and the interests of their partners. Research in Game Theory and Economics has significantly contributed to our understanding of mixed-motive behaviors (Nash Jr, 1950; Aumann, 1975; Fehr and Schmidt, 1999; Kelly, 2003; Von Neumann and Morgenstern, 2007). Not only has this led to useful insights about optimal solutions in situations of conflict, it has also resulted in formal models and the necessary terminology to concretely study realistic mixed-motive behaviors. Even though the underlying assumptions (such as that of rationality commonly made in Game Theory) may not always hold true, the developed theories continue to inspire efforts across several disciplines to this day, including Psychology, Affective Computing, and, more recently, AI research. The work presented in this dissertation relies heavily on one such formalism from the negotiation literature, which we will describe later in Section 2.2. Mixed-motive interactions like a typical negotiation or the prisoner’s dilemma scenario (Axelrod, 1980) are fraught with various selfish and cooperative persuasion strategies (de Melo et al., 2011; Chawla et al., 2021b). In addition, motivational tensions that arise in these decision-making scenarios naturally encapsulate rich emotion expression (Stratou et al., 2015; Chawla et al., 2021a). Consequently, mixed-motive interactions have received significant attention in Psychology and Affective Computing research, often seen as an interesting testbed to advance theories of human cognition and emotion (Carnevale and Pruitt, 1992; Gratch et al., 2015). This has led to insights about the role of demographics (Stuhlmacher and Walters, 1999), human personality (Van Lange et al., 2007; Xu et al., 2017; Yamagishi et al., 2017), and culture (Luo, 2008; Andersen et al., 2018). 11 Further, along with the individual performance of the players in a mixed-motive task, the literature has also stressed the importance of the relationship between the players in terms of rapport building (Nadler, 2004), favor exchange (Mell et al., 2015), and reputation effects (Zacharia and Maes, 2000). 2.2 Multi-Issue Bargaining Task We now describe a common and useful abstraction for studying negotiations in scientific literature known as the Multi-Issue Bargaining Task or MIBT (Fershtman, 1990). This framework forms the backbone of several recently developed negotiation datasets and automated systems. Consider a multi-party negotiation between players 1, 2, . . . , . Prior work has commonly considered bilateral negotiations, that is, a negotiation that involves two players ( = 2). MIBT involves a negotiation over issues: 1, 2, . . . , . These issues semantically refer to the different kinds of items that are up for grabs. Each issue is associated with levels or a total number of available items . For instance, a negotiation with two issues might include a total of 5 books and 3 balls. Each player starts with some predefined preferences towards these issues, and these preferences are assumed to be fixed throughout. Then, the players interact with each other to reach an agreement about a plausible division of all the available items, such that each item is assigned to exactly one player. Aligned with their individual preferences, each player receives some points (or a reward) for winning a single item of an issue as per the final agreed deal. Thus, the primary objective for a player is to maximize their total points : 12 = ∑︁ =1 , (2.1) where ∈ {0, 1, 2, . . . , } is the total number of items of issue that the player receives in the final agreement. We make several comments about this setup below. Repeated Negotiations: Equation 2.1 assumes that there is only one round of negotiation between the players. However, realistic situations often demand repeated interactions between the same players (for instance, in business contracts). In such a case, the above equation is usually modified to capture the total points received across all rounds. This has been investigated extensively in the literature, given its interesting implications on the optimal strategies and the increased importance of maintaining the relationship between the players (Milnor and Shapley, 1957; Rosenthal and Rubinstein, 1984; Oliver et al., 1994; Mell et al., 2019). In Section 2.3, We will briefly discuss the design of our automated agent, which successfully engages in three back-to-back negotiations with the same human partner. Subjective Measures of Performance: In addition to the total points scored, some researchers also incorporate several subjective measures of performance, usually gathered from the human players in a post-survey. For instance, this can include satisfaction with the outcome and the perception of their partners. In a dataset that we collected (Chawla et al., 2021b), we observed how it is possible for players to perform poorly in terms of the total points scored while still being highly satisfied with the outcome (discussed further in Chapter 3). Hence, capturing the subjective outcomes helps to make better sense of the final negotiation outcomes. In addition, as hinted above, these additional measures especially hold importance in the case of repeated interactions between the same players (Mell et al., 2019; Aydoğan et al., 2020). 13 No Explicit Access to Partner Preferences: Typically, the preferences of a player’s partner are not known to the player explicitly at the start of the negotiation. It is only through the interaction that the players understand each other’s objectives and try to reach an agreement. The task of inferring the partner’s hidden mental states or preferences is commonly referred to as Partner Modeling and will be called as such in this dissertation. This also relates to the concept of Theory-of-Mind in the literature (Frith and Frith, 2005; Ullman, 2023). As the reader might expect, prior work argues that a strong model of the partner is key for achieving successful outcomes in mixed-motive interactions, having implications on both objective and subjective measures of performance (Baarslag et al., 2013). It is also useful in providing automated pedagogical feedback to students of social skills training (Johnson et al., 2019a). BATNA: Another important concept in negotiations that governs the behavior of the players and the perceived success out of a negotiation is the Best Alternative to The Negotiated Agreement or BATNA (Fisher et al., 2011). BATNA essentially captures the best alternative in case no agreement is reached between the players in a negotiation. For instance, the datasets collected in the NLP community have used a fixed point value of 0 (Lewis et al., 2017) or the points ∗ corresponding to the highest priority issue ∗ for the player (Chawla et al., 2021b). In these cases, this means that the BATNA value is the same for all the players involved, but this may not be the case in realistic settings. BATNA is seen as a source of power in a negotiation, which if used strategically, can lead to improved self and joint outcomes (Pinkley et al., 1994). Integrative Potential: Recall that mixed-motive interactions capture the space between the two extremes of pure competition and pure collaboration. MIBT allows us to explore this space in a principled manner. Depending on how aligned the preferences of the players are, MIBT can result in a variety of negotiation scenarios. If the preferences completely match, this corresponds 14 to the zero-sum case, where the benefits of the players are inversely correlated. In such a case, the scenario is referred to as distributive, or having a low integrative potential. If, however, the preferences do not match, this paves the way for interesting solutions that benefit all the players involved. Such a case is said to have a high integrative potential. Given that it is common to assign player preferences randomly, MIBT-based tasks tend to cover the rich diversity in between these two extremes. To better measure the degree of collaboration between the two players, in addition to the individual points scored, researchers often also note the joint performance – the sum of the total points received by all the players involved. In other words, this measures the total joint value created in the negotiation (Chawla et al., 2023c). Practical Use: MIBT is capable of allowing sufficient flexibility to study a number of realistic negotiation behaviors. Later Chapters in this dissertation will provide extensive support for this argument. It has proved to be useful for defining many negotiation tasks in NLP (Lewis et al., 2017; Chawla et al., 2021b; Yamaguchi et al., 2021), beyond NLP (Mell and Gratch, 2017), and in the industry as well (such as iDecisionGames1 ). However, we would like to stress that it is merely an abstraction inspired by real-world interactions and the assumptions that the described setup makes (such as the player preferences remain fixed) may not always hold true. We will come back to these limitations later in Chapter 9. Automated Systems: To achieve success in MIBT, negotiators must understand their own preferences and excel at partner modeling while employing strategic reasoning and effective communication to come to an agreement. This makes designing automated agents for these tasks a challenging undertaking. The agents must wisely choose their strategy for sending offers and responding to incoming offers while also ensuring appropriate emotion expression. Consequently, 1https://idecisiongames.com/promo-home 15 to reduce the task complexity, early efforts in this direction relied on structured interfaces that provide a menu of options to receive inputs from human players. We will discuss these efforts in the next section. As a concrete example, we will also briefly describe our automated agent designed for such a platform. Later, we will discuss the need for more realistic means of communication, such as natural language, which is the focus of the rest of the manuscript. 2.3 Menu-Driven Interfaces Most of the prior research in automatic negotiation agents has focused on agent-agent interactions (Williams et al., 2012; Lin et al., 2014; Cao et al., 2018), targeting distributed problem-solving and higher computational efficiency. Thousands of offers can be quickly exchanged between two negotiating agents using a structured communication channel. Although agent-human negotiations are fundamentally different, agent-agent research has inspired some early efforts. To reduce the complexity of the communication interface, early efforts in agent-human negotiations relied on restrictive communication protocols involving a pre-defined menu of options. One example of such a design (based on MIBT) is the Interactive Arbitration Guide Online (IAGO) negotiation platform (Mell and Gratch, 2016). We provide a screenshot of the IAGO platform in Figure 2.1. In essence, IAGO provides the human players with a menu of options. When it is their turn, the player clicks buttons to select their next response for the agent. As visible in the screenshot, the human players can make full or partial offers (bottom left). They can also share their own preferences with the agent or ask the agents for their preferences (bottom right). Further, the interface allows the player to send text messages by selecting from a fixed predefined 16 Figure 2.1: A screenshot of the IAGO negotiation platform. The interface provides a menu of options to receive inputs from the human players. list of options. Finally, the player can express emotions (angry, sad, neutral, surprise, happy) by clicking on emoticon icons. Advancing Mixed-Motive Research with IAGO: Sending in structured responses through such a menu of options with button clicks makes the design of the automated system much simpler – one can write rules that govern the complete behavior of the agent, conditioned on the incoming responses from the human player. This allows researchers to experiment with numerous theories developed in the mixed-motive literature (Section 2.1). For instance, IAGO 17 supports anchoring, where agents start with highly unfair offers to pivot the rest of the interaction in their favor (Guthrie and Orr, 2006). The agents can be built with diverse concession curves and personalities. Competitive agents tend to start with selfish offers that can be unfair to their human partners but slowly concede more and more items as the interaction progresses. Consensusbuilding agents engage in the interaction to slowly build the deal with their partners. One can also design cooperative agents that aim for more fair agreements. Further, the agents can use framing via textual messages to impact the decision-making of their partners (Tversky and Kahneman, 1981). Other popular strategies grounded in prior research include appropriate emotion expression, partner modeling, and deception. IAGO breaks down these strategies into three policies: Behavior Policy (agent’s core strategy for sending and responding to offers), Message Policy (how the agent handles text messages), and Expression Policy (how the agent expresses emotions). Successful agents in IAGO typically make effective use of all three available policies. To give a concrete example, we now briefly describe our own agent designed for IAGO – Pilot: the winner of the Human-Agent Negotiation (HAN) League at the Automatic Negotiating Agents Competition (ANAC) co-located with IJCAI 2020 (Chawla and Lucas, 2020). 2.3.1 Pilot: Winner of the HAN Challenge at ANAC (IJCAI 2020) The Pilot agent engages in a sequence of three back-to-back negotiations with a human partner. The negotiations are based on MIBT and are conducted over the IAGO platform. Various principles from the literature guide the behavior and the overall personality of our agent. 18 Agent Personality: Justifying the name, Pilot attempts to lead the negotiation with the human partner while also maintaining a friendly persona. This allows Pilot to cater to its current objectives while maintaining a positive partner perception. To achieve this, our agent pushes the human partner to first share some of their preferences before discussing any offers. Previous results at ANAC show that humans tend to remain truthful and send out only a few offers themselves (Mell et al., 2018a). Hence, this push allows the agent to build a reasonable partner model before rolling out offers. Further, the agent uses phrases like “Let me help you out” to help the human in navigating the negotiation and as a result, portrays itself as experienced and builds a rapport with the human partner, which has been widely shown to build the joint value in negotiations (Nadler, 2004). Further, the agent does not lie about its preferences and does not hold back information when explicitly asked. However, it also does not give any out information, if not explicitly asked. Behavior Policy: The behavior policy defines the core negotiation strategy of the agent. Pilot follows a competitive negotiation strategy in all three negotiations, where it starts off with a high initial offer and keeps conceding the least wanted items one by one. Since the agent pushes its partner to share their preferences first, it starts off with a more reasonable initial offer, which has a better chance of getting accepted than an all-vs-none offer. However, this initial offer does become more favorable for the agent in later negotiations. In addition, inspired by agent Wotan (Mell et al., 2018a), Pilot only deals in full offers to save time. IAGO also allows players to exchange favors - a favor received in one negotiation can be paid back in the upcoming rounds. The baseline agent provided in IAGO recommends exchanging these favors early on in the negotiations. This has a couple of disadvantages: 19 1. First, whether the favor request is fruitful or not depends on the personality of the partner such as their Social Value Orientation (Van Lange et al., 1997; Cornelissen et al., 2009). Hence, this may not work with partners who portray competitive or selfish behavioral traits. In fact, based on prior ANAC competitions, favor exchange may even backfire with selfish partners (Mell et al., 2018b). 2. Secondly, even if the favor request is accepted, the agent only takes minimal advantage by claiming only a single valuable item. Ideally, to address the first concern, information about the personality traits of the partner can be useful. However, such information is not available to the agent. Hence, Pilot leverages prior research which suggests that appropriate framing can help promote pro-social behaviors (Pulford et al., 2016). We use sentences such as “I am excited to build the value for both of us. I hope you are equally excited as well.” and “Please remember: Our joint decisions will determine how many points we both earn.” to prime the pro-social behavior in the human partners, with the aim of increasing the likelihood of the favor request being accepted. In addition, to take the maximum advantage of favor acceptance, Pilot does not indulge in favor exchange early on, unlike the baseline agent. Instead, it holds off the request until the human partner has shared some of their preferences. If the favor is accepted, this allows our agent to roll out a full offer which greatly benefits the agent (addressing the second concern). While returning the favor, the agent starts with a reasonable initial full offer based on the discussed preferences but further concedes three additional items to the human partner. Message Policy: The message policy defines how the agent uses predefined text messages. IAGO provides a baseline message policy with predefined messages for a variety of common events that 20 can arise during the interaction. Pilot’s strategy mostly follows the provided baseline strategy with a few modifications: 1. We modified several message choices in the baseline agent, which we believe might have confused the human partner in the context of the IAGO platform. 2. If asked, the agent lies about its BATNA. Pilot inflates the original value of BATNA by 1.5 times. This allows the agent to negotiate from a position of higher bargaining power. Expression Policy: The expression policy defines how the agent exhibits emotions. These are displayed by the virtual human on the IAGO platform (see top left of Figure 2.1). The Pilot agent displays moderate expressions for the first two negotiations, promoting a healthy relationship with the partner. However, research has shown the benefits of extreme emotions such as anger in inducing more concessions (de Melo et al., 2011). Hence, for the third and final negotiation, the agent displays anger if an offer is rejected by the human player in the hope of getting a better deal. Through a combination of behavior, message, and expression policies, Pilot negotiates with a human partner in a sequence of three back-to-back negotiations via a highly structured negotiation interface. As quite evident from the description of the Pilot agent, this structure allows the designers of such agents to control the design at a very granular level. This makes such systems useful for Psychological and Affective Computing research (Aydoğan et al., 2020). However, this concreteness comes at a cost – it hinders the analysis and incorporation of several aspects of real-world negotiations, making such agents less useful in realistic settings. 21 2.3.2 Towards More Realistic Communication Channels Menu-driven interfaces lay the foundation for research in human-machine negotiations. Systems based on this design, such as Pilot, have been highly successful in enabling researchers to understand various aspects of human-machine mixed-motive interactions, such as effective strategies, the role of emotions, personality, and demographics. This body of work also forms the basis for understanding the extent to which prior research on negotiation principles (Lewicki et al., 2016) can be leveraged for building more sophisticated automated systems. Unfortunately, such structured communication is restrictive and only allows a limited expression of linguistic strategies and emotions. Further, the interface is unnatural for human players, introduces a cognitive load, and hence, distorts conclusions and limits practical usability. For instance, the human partners that interact with agents built on IAGO tend to roll out only a few offers, possibly due to the unnatural mode of communication. There is evidence that this effect can be mitigated when the human players are allowed to negotiate in free-form natural language (He et al., 2018; Chawla et al., 2021b). Therefore, it is highly desirable to incorporate additional modalities for communication that better capture the complexities of real-world negotiations, making the designed systems far more useful in downstream applications. To this end, the focus of this dissertation is to investigate human-machine mixed-motive interactions using natural language. This deviation from menu-driven communication means that instead of relying on rule-based techniques, we need a data-driven approach that better captures the intricacies of free-form dialogue. Instead of rule-based agents, we need to cast our agent as a dialogue system which can engage in natural conversations with human partners – a central area of research in the field of NLP. Hence, we now provide a brief overview of the NLP research on 22 dialogue systems, setting the stage for the work presented in later Chapters which builds on top of the advances in this area. 2.4 Research on Dialogue Systems Designing a dialogue system capable of engaging in natural conversations with human partners has been a long-standing goal in NLP and AI research. However, practically useful systems have been developed only recently, with breakthroughs in deep learning, compute resources, largescale language modeling, and reinforcement learning. Dialogue systems are designed for a wide variety of application domains, such as customer support, conversational AI assistants, or even chitchat companions. In addition to the conversation history, these models must often consider additional task context depending on the downstream use case, such as knowledge documents, persona statements, or even available API calls for assisting the users. Gao et al. (2018) provide a unified view of this research by formulating a dialogue system as a sequential decision-making process, where the agent chooses its next action (typically the next response) to optimize a suitable objective. Figure 2.2: Overview of a modular task-oriented dialogue system, taken from Gao et al. (2018). NLU: Natural Language Understanding, DM: Dialogue Manager, DST: Dialogue State Tracking, POL: Dialogue Policy, NLG: Natural Language Generation, DB: External Database. 23 Dialogue systems are primarily categorized into two groups: Task-Oriented and Open-Domain (Chen et al., 2017). Task-Oriented or Goal-oriented dialogue systems work towards completing a specific task or achieving a pre-defined, concrete goal (Zhang et al., 2020c). In their most trivial form, such conversations can be highly structured, where the dialogue system is only expected to indulge in conversations related to the task at hand, although recent work has stressed the importance of handling off-topic interactions for improved user experiences (Yan et al., 2017). Such conversations are mostly driven by the system itself, where it tries to request all the information that is necessary to fulfill the task. This provided structure in such settings can allow the researchers to model the state of the system in terms of intents and slots. For instance, say the user wants to reserve movie tickets (intent). Then, the system must gather the necessary information (in the form of slots and values) such as the number of tickets and the city. We provide an overview of a typical modular task-oriented dialogue system in Figure 2.2 that makes use of these intents and slots. More recent techniques have relied on language models for designing end-to-end systems (Su et al., 2022). To evaluate performance for task-oriented dialogue systems, the primary metric is, understandably, the rate of task completion, but other important metrics include the overall sensibility, coherency, and efficiency of the system. On the other hand, open-domain dialogue systems focus on open-ended conversations with the users. Rather than completing a specific pre-defined task, open-domain systems aim to engage the users in free-flowing interactions (Huang et al., 2020). The lack of a defined goal means that the system must be able to handle general conversations on a variety of topics. The task is usually formulated as a sequence-to-sequence learning problem, and the model is trained on humanhuman dialogue datasets in an end-to-end manner. More recently, language models trained on a humongous amount of conversational and non-conversational data and then instruction-tuned 24 with human feedback result in sophisticated systems for open-ended interactions (Xia et al., 2024). A central challenge in such open settings is the evaluation of the system since the objectives are themselves fuzzy. The subjectivity in human feedback (on what is considered a valid response) and its high variance across different age groups, ethnicities, and cultures makes evaluation extremely difficult. Researchers propose that various metrics that capture the quality of an individual utterance, as well as the conversation as a whole, should be looked at collectively (Mehri and Eskenazi, 2020). At a single utterance level, important metrics include interestingness, relevance, specificity, appropriateness, and fluency. At the level of the entire conversation, metrics such as diversity, coherency, depth, and consistency are considered. Advances in dialogue research in NLP inspired initial efforts for designing systems for mixedmotive interactions based on both modular (He et al., 2018) and end-to-end techniques (Lewis et al., 2017). However, as hinted earlier in Chapter 1, this traditional categorization of dialogue systems into task-oriented and open-domain is inadequate for capturing mixed-motive dialogues. In a typical negotiation, there is clearly a concrete notion of a goal for the system (reaching a favorable agreement with the partner). Hence, dialogue systems for these tasks clearly cannot be categorized as open-domain. Furthermore, the goal of the system goes beyond passively assisting the users in accomplishing their tasks. Instead, both the dialogue agent and the human player may have their own, often conflicting goals, necessitating active participation from both parties to build rapport and reach successful agreements. This fundamentally differs from the typical formulations used to describe task-oriented dialogue systems. This incompatibility between existing task definitions in the research community and the behaviors observed in mixed-motive interactions hinders systematic progress in this area, where efforts have usually been limited and scattered. Hence, later in Chapter 8, we will discuss our efforts in reshaping the existing dialogue research landscape 25 to better incorporate mixed-motive interactions. For now, however, just like how typical dialogue research begins with the collection of human-human dialogue datasets (Chen et al., 2017), the next Chapter describes our efforts in curating a novel dataset of realistic mixed-motive interactions that fueled our future investigations in this area. 26 Chapter 3 The CaSiNo Dataset Initial efforts in building dialogue systems for mixed-motive interactions looked at negotiation game environments (Asher et al., 2016; Lewis et al., 2017). DealOrNoDeal (Lewis et al., 2017) involves two negotiators who split given quantities of three arbitrary items: books, balls, and hats. This provides a concrete structure to the negotiation, keeps the design tractable, and ensures a reliable evaluation based on the final points scored. Many practical solutions in negotiations follow similar closed-domain designs based on MIBT (Section 2.2). However, most of the dialogues in these game settings reduce to merely an exchange of offers from both sides. For instance, ‘i need the book and the balls you can have the hat’ or ‘i want the ball and 2 books’ in DealOrNoDeal. One reason for this lack of richness in language use is that the items are arbitrarily defined – there is no semantic context around the items that the participants are negotiating for. Hence, this setup fails to capture many realistic aspects of negotiations, such as small talk, preference elicitation, emotion expression, and convincing strategies based on individual preferences and requirements. Emulating real-world negotiations is desirable for developing practical systems for social skills training and robust AI assistants that are useful in realistic scenarios. 27 On the other extreme, the CB dataset (He et al., 2018) involves buyer-seller negotiations to finalize the price of a given product. Targeting the collection of more open-ended dialogues, the participants are also encouraged to discuss side offers, such as free delivery or selling other accessories at the same price. Although this promotes diversity and rich natural conversations, unfortunately, such open-ended domains make the evaluation of negotiation performance nontrivial, which also inhibits the practical applicability of the systems developed on such datasets. For instance, in skills training, it is desirable to judge the performance and provide critical feedback (Monahan et al., 2018). To address these shortcomings, we design a novel negotiation task. Our design is based on MIBT, which provides a tractable closed-domain abstraction but is infused with a real-world camping scenario, resulting in rich dialogues for natural language research (Section 3.1). The task involves two participants who take the role of campsite neighbors and negotiate for additional Food, Water, and Firewood, based on individual preferences and requirements. Based on this design, we collect CaSiNo: a corpus of 1030 Camp Site Negotiation dialogues in English1 . The dialogues contain various aspects of a realistic negotiation, such as rapport building, discussing preferences, exchanging offers, emotion expression, and persuasion with personal and logical arguments. We also collect the participants’ satisfaction from the outcome and how much they like their partners, both being important metrics in negotiations (Mell et al., 2019). We annotate 9 persuasion strategies that span cooperative to selfish dialog behaviors (Section 3.2). We perform an extensive correlational analysis to investigate the relationship among the final outcomes and explore how they relate to the use of negotiation strategies (Section 3.3). Further, 1We publicly release the dataset to enable future work in this direction: https://huggingface.co/datasets/ casino. 28 we propose a multi-task framework with task-specific self-attention mechanisms to recognize these strategies in a given utterance (Section 3.4). 3.1 Data Collection Preferences & Arguments P1 P2 High: Water: We like to go on runs and it increases the need of this. High: Food: Food really increases everyones morale. Medium: Food: Food overall is a good mood booster. Medium: Firewood: We like to have a large fire. Low: Firewood: We do not care for fire and it is not necessary to us. Low: Water: We don’t drink water that often. Conversation Annotation P1: How are you today? Did you have any preferences on the supplies we will be trading? Small-Talk, Coordination, Elicit-Pref P2: I am good. How about yourself? I think I would like some firewood to start off with. We like to have bigger fires. What about you? Small-Talk, Self-Need, OtherNeed, Elicit-Pref P1: I am good as well. That is good to hear that you like to have bigger fires as we do not care much for that. We would much rather have some extra water. Small-Talk, Empathy, No-Need P2: Water is a little important to us too though , if possible maybe we can split that or maybe we can get some more food in replacement. Coordination P1: That may be possible.... What did you have in mind for the food replacement? Non-strategic P2: You can have all the water if we can have all the food? Non-strategic P1: I dont think I am okay with that . Food is essential to our groups morale when camping. We would like 1 additional food preferably. Self-Need, OtherNeed P2: Well you guys did say you did not care much about large fires. What if you gave all the firewood in replace for the water and you can still keep 1 food? UV-Part, Coordination P1: So I would get 3 water and 1 food and youd get 3 firewood and 2 food? Non-strategic P2: Yea that seems like an alright trade to me Non-strategic P1: Hmm... alright then Non-strategic P2: Submit-Deal P1: Accept-Deal Table 3.1: Sample dialogue from the CaSiNo dataset. P1 and P2 represent two participants in our data collection study. 29 Our data was crowd-sourced on Amazon Mechanical Turk. We describe our design by following the journey of a specific participant in our study. Pre-Survey: We start by collecting demographics and psychological personality traits of the participants that relate to their negotiation behaviors. For demographics, we gather age, gender, ethnicity, and the highest level of education. We consider two measures of individual personality differences: Social Value Orientation or SVO (Van Lange et al., 1997) and Big-5 personality (Goldberg, 1990) that have been heavily studied in the context of negotiations (Bogaert et al., 2008; Curtis et al., 2015). SVO classifies the participants as Prosocial, who tend to approach negotiations cooperatively, or Proself, who tend to be more individualistic. Big-5 personality test assesses the participants on five dimensions: Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experiences. Our participants exhibit diverse demography and psychological personality. We provide aggregate statistics in Appendix A.1. Negotiation Training: Research shows that the average human is bad at negotiating (Wunderle, 2007; Babcock and Laschever, 2009), which can adversely impact the quality of the collected dialogues and, consequently, the system trained on them. One way to mitigate this is by using reinforcement learning to optimize a reward that measures the negotiation performance. RL training has proved to be challenging and often leads to degeneracy (Lewis et al., 2017). Further, this ignores prior work in human-human negotiations that provides guidelines for achieving favorable outcomes in realistic negotiations (Lewicki et al., 2016). To incorporate these best practices in a principled way, we design a training module. Each participant is asked to watch a video tutorial before their negotiation. The tutorial takes an example of a negotiation between two art collectors to encourage them to follow some of the best practices in negotiations (Lewicki et al., 2016), including 1) Starting with high offers, 2) Discussing 30 preferences, 3) Appropriate emotion expression, and 4) Discussing individual requirements to make convincing arguments. This results in a rich and diverse set of dialogues, as we will explore later. We release the complete video tutorial publicly, with the hope that it promotes reproducibility and helps researchers to design similar data collection experiments in the future: https://youtu.be/7WLy8qjjMTY. Preparation Phase: Several requirements guide our design choices: 1) Semantically Meaningful: The context must be meaningful and relatable for MTurk participants and for anyone who negotiates with the system trained on this dataset. This allows the participants to indulge in personal and contextual conversations, making the resulting system more useful for downstream applications. 2) Symmetric task: The task should be symmetric for both the participants so that a dialogue system may leverage both sides of the conversations during modeling, and 3) Symmetric items: The items which the participants are negotiating for should be symmetric in the sense that an individual can resonate with any preference order assigned to them. Hence, every category of items can be more desirable over others depending on a real-world context. Our scenario is an instance of MIBT (Section 2.2). The task involves campsite neighbors who negotiate for additional Food, Water, and Firewood packages, each with a total quantity of three. Instead of choosing an arbitrary set of items, each item represents quite relatable, basic requirements that one might plausibly have for an actual camping trip. The items were only broadly defined to encourage diversity. One challenge when dealing with a realistic context like camping is the inherent bias that one might have towards one item over others, which violates our symmetry constraint. To mitigate this, we emphasize that the camping authorities have already provided the basic essentials, and the participants will be negotiating for extras based on their 31 individual plans for camping. We present the negotiation scenario, as seen by participants, in Appendix A.2. The three item types are assigned a random priority order for every participant using a permutation of {High, Medium, Low}. As in realistic negotiations, the participants are asked to prepare for their negotiation by coming up with justifications for the given preferences before the negotiation begins (precise question format in Appendix A.7), for instance, needing more water supplies for a hike or firewood for a bonfire with friends. We find that the participants are able to come up with a variety of arguments from their own camping experiences, such as Personal Care, Recreational, Group Needs or Emergency requirements. We illustrate some of these arguments in Appendix A.2. The participants were encouraged to use their justifications as they felt fit to negotiate for a more favorable deal. Negotiation Dialogue: Finally, two participants are randomly paired to engage in an alternating dialogue for a minimum total of 10 utterances. We also provide the option to use emoticons for four basic emotions, namely, happy, sad, anger, and surprise. After coming to an agreement, the participants submit the deal formally using the provided options. They can also walk away from the negotiation if they are unable to come to an agreement. The primary evaluation metric to assess the negotiation performance is the number of points scored by a negotiator. Every High, Medium, and Low priority item is worth 5, 4, and 3 points respectively, such that a participant can earn a maximum of 36 points if she is able to get all the available items. Post-Survey: We collect two other evaluation metrics relevant to negotiations: 1) 5-point scale for satisfaction (How satisfied are you with the negotiation outcome?) and 2) 5-point scale for partner likeness (How much do you like your partner?). Back-to-back negotiation (Aydoğan et al., 2020) is an interesting case where the relationship with the partner is crucial. In such a case, a poor 32 relationship in earlier negotiations can adversely impact the performance in later rounds. Further, for some cases in CaSiNo, we observed that the participants were satisfied with their performance despite performing poorly because they thought that the arguments of their partners for claiming the items were justified. One might argue that this is still a successful negotiation. Hence, we believe that all the metrics defined here are important in the context of real-world negotiations and propose that they should be looked at collectively. We will further analyze these outcome variables in Section 3.3 where we study the correlations between the participants’ negotiation behaviors and these metrics of negotiation performance. Data Collection: We collected the dataset over a month using the ParlAI framework (Miller et al., 2017). Screenshots from the interface are provided in Appendix A.7. The participant pool was restricted to the United States, with a minimum of 500 assignments approved and at least a 95% approval rate. We post-process the data to address poor-quality dialogues and inappropriate language use. We describe these post-processing steps in Appendix A.3. Finally, we ended up with 1030 negotiation dialogues between 846 unique participants. On average, a dialogue consists of 11.6 utterances with 22 tokens per utterance. We present a sample dialogue with the associated participant profile in Table 3.1. The participants are rewarded a base amount of $2 for their time (around 20 minutes). Further, they were incentivized with a performance-based bonus of 8.33 cents for every point that they were able to negotiate for. If a participant walks away, both parties get the amount corresponding to one high item or the equivalent of 5 points. The bonus is paid out immediately after the task to encourage participation. Overall, the participants had highly positive feedback for our task and could relate well to the camping scenario, engaging in enjoyable, interesting, and rich personal conversations. We discuss their feedback with examples in Appendix A.4. 33 3.2 Strategy Annotations Label Example Count Prosocial Generic Small-Talk Hello, how are you today? 1054 0.81 Empathy Oh I wouldn’t want for you to freeze 254 0.42 Coordination Let’s try to make a deal that benefits us both! 579 0.42 Prosocial About Preferences No-Need We have plenty of water to spare. 196 0.77 Elicit-Pref What supplies do you prefer to take the most of? 377 0.77 Proself Generic Undervalue-Partner Do you have help carrying all that extra firewood? Could be heavy? 131 0.72 Vouch-Fairness That would leave me with no water. 439 0.62 Proself About Preferences Self-Need I can’t take cold and would badly need to have more firewood. 964 0.75 Other-Need we got kids on this trip, they need food too. 409 0.89 Non-strategic Hello, I need supplies for the trip! 1455 - Table 3.2: Utterance-level strategy annotations. refers to Krippendorff’s alpha among 3 annotators on a subset of 10 dialogues (∼ 120 utterances). An utterance can have multiple labels. After collecting the dataset, we developed an annotation schema to analyze the negotiation strategies used by the participants and to facilitate future work. We follow the conceptual content analysis procedure (Krippendorff, 2004) to design the scheme. Being a natural conversational dataset, we find several instances where a strategy spans multiple sentences in an utterance, as well as instances where the same sentence contains several strategies. Hence, we define an utterance as the level of analysis. Each utterance is annotated with one or more labels. If no strategy is evident, the utterance is labeled as Non-strategic. Although we label entire utterances, self-attention shows some promise as an automatic way to identify which part of an utterance corresponds to a given strategy, if desirable for a downstream application (Section 3.4). 34 As discussed in Chapter 2, mixed-motive tasks involve a variety of strategic behaviors. Broadly, human negotiation behaviors can be categorized as Prosocial, which promote the interests of others or the common good, and Proself, which tend to promote self-interest in the negotiations (Van Lange et al., 2007; Yamagishi et al., 2017). Another important criterion is discussing preferences. Prior work suggests that humans negotiate with a fixed-pie bias, assuming that the partner’s preferences align, and hence achieving sub-optimal solutions (Kelley, 1996). Based on these distinctions and manual inspection, we define 9 strategies used in the CaSiNo dataset. The usage of these negotiation strategies correlates with both the objective and subjective metrics of negotiation performance. 3.2.1 Prosocial Strategies Prosocial strategies address the concerns of both the self and the negotiation partner. We define three strategies that exhibit generic Prosocial behavior. Small-Talk: Participants engage in small talk while discussing topics apart from the negotiation in an attempt to build a rapport with the partner. For example, discussing how the partner is doing during the pandemic or sharing excitement for the camping trip. Rapport has been well studied to positively impact negotiation outcomes (Nadler, 2004). Small talk usually appears either at the beginning or at the end of the negotiation. Empathy: An utterance depicts Empathy when there is evidence of positive acknowledgments or empathetic behavior towards a personal context of the partner, for instance, towards a medical emergency. Empathy promotes Prosocial behaviors in interpersonal interactions (Klimecki, 2019). 35 Coordination: This strategy is used when a participant promotes coordination among the two partners. This can be, for instance, through an explicit offer of a trade or mutual concession or via an implicit remark suggesting to work together towards a deal. Further, we define two strategies that relate to Prosocial behavior about individual preferences: No-Need: This appears when a participant points out that they do not need an item based on personal context, such as suggesting that they have ample water to spare. No-Need can directly benefit the partner since it implies that the item is up for grabs. Elicit-Pref: This is an attempt to discover the preference order of the partner. CaSiNo covers a range of scenarios based on how aligned the preferences of the two parties are. Generally, we find that discussing preferences upfront leads to smoother negotiations without much back and forth. 3.2.2 Proself Strategies Proself behavior attempts to serve personal performance in a negotiation. We define two strategies exhibiting generic Proself behavior. Undervalue-Partner: Also referred as UV-Part, this strategy refers to the scenario where a participant undermines the requirements of their partner, for instance, suggesting that the partner would not need more firewood since they already have the basic supplies or a suggestion that there might be a store near the campsite where the partner can get the supplies instead. Vouch-Fairness: This is a callout to fairness for personal benefit, either when acknowledging a fair deal or when the partner offers a deal that benefits them. For instance, through an explicit callout ‘this deal is not fair’, or implicitly saying ‘this does not leave me with anything’. 36 Finally, we consider two Proself strategies that relate to individual preferences: Self-Need: Here, the participant uses arguments to create a personal need for an item in the negotiation. For instance, a participant pointing out that they sweat a lot to show preference towards water packages. Other-Need: This strategy is similar to Self-Need but is used when the participants discuss a need for someone else rather than themselves. For instance, describing the need for firewood to keep the kids warm. Negotiating on behalf of others is densely studied as a competitive strategy, where negotiators engage in contentious, demanding, and inflexible bargaining behaviors (Adams, 1976; Clopton, 1984). Collecting annotations: Three expert annotators2 independently annotated 396 dialogues containing 4615 utterances. The annotation guidelines were iterated over a subset of 5 dialogues, while the reliability scores were computed on a different subset of 10 dialogues. We use the nominal form of Krippendorff’s alpha (Krippendorff, 2018) to measure the inter-annotator agreement. We provide the annotation statistics in Table 3.2. Although we release all the annotations, we skip Coordination and Empathy for our analysis in this work due to higher subjectivity resulting in relatively lower reliability scores. In the following sections, we will refer to this annotated subset of CaSiNo as CaSiNo-Ann. 3.3 Correlational Analysis We next perform correlational analysis on CaSiNo-Ann to understand how the points scored by a participant relate to their satisfaction with the outcome and their partner’s perception. We 2Researchers involved in the project. 37 further shed light on what kind of strategies are more likely to lead to better outcomes. Such insights motivate our experiments on strategy prediction and would direct future efforts in building negotiation systems. We present complete results in Appendix A.5 and discuss the significant observations below. Relationship among outcome variables: We consider the points scored, satisfaction from the outcome, and partner likeness. We find that the points scored by a participant are positively correlated with their own satisfaction (=0.376, < 0.01) and with their perception of the partner (=0.276, < 0.01). Similar trends are visible with the corresponding variables of the negotiation partner as well, suggesting that the participants secured more points while still maintaining a positive perception in the eyes of their partners. Discovering the integrative potential: Integrative potential in a negotiation is based on how aligned the partner preferences are. Complete alignment leads to a distributive (or zero-sum) negotiation, having a low integrative potential where the benefit of one results in a high loss for the other. A negotiation is integrative if the preferences do not align, allowing for solutions that maximize mutual points. We assign each dialogue either 1, 2, or 3, depending on whether the integrative potential is low, medium, or high. The maximum joint points possible in these cases are 36, 39, and 42, respectively. We find that the participants are able to discover this integrativeness, thereby achieving significantly more joint points as the potential increases ( = 0.425, < 0.001). Use of negotiation strategies: Overall, we find that greater use of Prosocial strategies shows a general pattern to predict higher ratings for both subjective measures of satisfaction and likeness, for self as well as the partner. Engaging in small talk shows significant positive correlations ( < 0.01), confirming our hypothesis from prior work that it relates to healthier relationships among the negotiators. Similar effects are visible for No-Need ( < 0.05), where the participant 38 decides to let go of one of their low-priority items. Since this directly benefits the partner, it is likely to improve the participant’s perception. On the other hand, Proself strategies show a general pattern to predict lower satisfaction and likeness ratings for both the self and the partner. We observe a significant negative correlation for both Other-Need and Vouch-Fair ( < 0.01). Further, we find that these competitive strategies are also associated with lower points scored by the participant and the partner, and hence, the joint points ( < 0.01). These correlations are not influenced by the integrative potential in the scenario, as when the integrated potential is controlled for, the effects generally remain unchanged and demonstrate the same patterns. We further observe that the dialogue behavior of a negotiator significantly relates to the behavior of their partner, where both tend to use similar negotiation strategies ( < 0.01). Our findings show that Prosocial strategies are more likely to be associated with Prosocial behavior in the partners and achieve more favorable outcomes in our negotiation scenario as compared to Proself. These results suggest that an automated negotiator can benefit by employing different strategies based on the Prosocial or Proself behaviors of the partner, for instance, by matching Prosocial behaviors but not Proself. The first step in this direction is to recognize them in a given utterance, which is the focus of the next section. 3.4 Strategy Prediction For building an automated dialogue system that incorporates the negotiation strategies discussed above, an important first step is to build computational models that recognize their usage in the observed utterances. Hence, we explore the task of strategy prediction, given an utterance and its previous dialogue context. 39 3.4.1 Methodology Figure 3.1: Architecture for multi-task strategy prediction. + represents element-wise summation. Pre-trained models have proved to be useful on a number of supervised tasks with limited in-domain datasets. Inspired by this success, we use BERT-base (Devlin et al., 2019) as the core encoding module. A natural way to use pre-trained models for our task is to fine-tune the model for every label independently in a binary classification setup, where the positive class represents the presence of a strategy, and the negative represents its absence. However, most of the utterances in the CaSiNo-Ann dataset are Non-strategic, resulting in a high imbalance where most of the data points belong to the negative class. As we later show, directly fine-tuning the BERT model fails to recognize the strategies for which the data is most skewed. 40 We instead propose a multi-task learning framework to allow parameter sharing between the different prediction tasks. Our architecture involves a common BERT-base encoder shared with all the tasks but uses task-specific self-attention to allow the model to focus on the most relevant parts of the input for each task separately. Consequently, this also enables interpretability by allowing us to visualize which parts of an utterance are attended for any given strategy. Our input consists of a finite-size context window, which loses the turn index for a specific utterance. Hence, we also capture the turn position for each utterance using sinusoidal positional embeddings (Vaswani et al., 2017). We present the complete architecture in Figure 3.1. In-Domain Pre-Training (IDPT): CaSiNo-Ann is nearly 40% of the entire CaSiNo dataset. To incorporate the unannotated dialogues, we employ In-Domain Pre-training of the BERT encoder (Sun et al., 2019). For this purpose, we consider each unannotated dialogue as a separate sequence and fine-tune the BERT-base architecture on the Masked Language Modelling (MLM) objective (Devlin et al., 2019). This allows us to use the complete CaSiNo dataset in a principled way. 3.4.2 Experimental Design Evaluation Metrics: We compare our methods for each strategy label on F1-score for positive class (presence of strategy label). To capture the overall performance, we report average F1 across all labels with uniform weights. Inspired by Joint Goal Accuracy from Dialog State Tracking (Kumar et al., 2020), we define another overall metric called Joint-A, which measures the percentage of utterances for which the model predicts all the strategies correctly. 41 Methods: Fine-tuning the pre-trained models has achieved state-of-the-art results across many supervised tasks. Hence, our primary baseline is BERT-FT, which fine-tunes the BERT-base architecture for binary classification of each strategy label separately. We consider a Majority baseline, where the model directly outputs the majority class in the training data. We also implement a Logistic Regression model for each label separately based on a bag-of-words feature representation of the input utterance. We refer to this model as LR-BoW. We refer to our complete architecture presented in Figure 3.1 as Full and consider its ablations by freezing the BERT layer (Freeze), removing task-specific self-attention (No Attn), or removing the turn position embeddings (No Feats). We also implement a simple over-sampling strategy where every utterance with at least one strategy is considered twice while training (referred to as OS). For IDPT, we fine-tune BERT for 20 epochs using a masking probability of 0.3. We also tried a lower masking probability of 0.15, however, in that case, the model is unable to learn anything useful on our relatively small dataset. Training Details: Our context window considers past 3 utterances and concatenates them using an EOS token. The embedding dimension is 768 for the encoder and the task-specific self-attention layers, each having only one attention head. We use the turn position embeddings of 32 dimensions. We train the models with Adam optimizer with a learning rate of 5 −05 and weight decay of 0.01. We use ReLU activation for feed-forward layers and a dropout of 0.1 to prevent overfitting. The models were trained for a maximum of 720 iterations with a batch size of 64 (∼ 13 epochs). We checkpoint and evaluate the model after every 72 iterations and the best-performing checkpoint on a held-out 5% validation set is used for evaluation. We provide further training details, including specifics of the architecture design, computing infrastructure, and hyper-parameter tuning in Appendix A.6. 42 3.4.3 Results Table 3.3 summarizes the results on 5-fold cross-validation. Majority baseline fails to recognize any of the strategies due to the data being skewed towards the negative class. It still achieves 39.4% Joint-A, indicating that these many utterances have none of the seven strategies present. Incorporating the bag-of-words features, LR-BoW performs much better than Majority. BERTFT highly improves the performance on all strategies except No-Need and UV-Part, for which the dataset is the most skewed. However, our Full multi-tasking framework is able to tackle the imbalance in these strategies through parameter sharing between all tasks. It achieves 36.4% F1 for No-Need and 44.5% F1 for UV-Part, indicating more than 100% improvements in both the cases. The model also improves F1 scores for all other metrics, but the improvement is not that substantial. Relatively lower scores for Freeze and No Attn suggest that both fine-tuning and task-specific attention layers are essential for performance. Turn position embeddings, however, only help for a few strategies, indicating the diverse usage of strategies in CaSiNo-Ann. Overall, we find that using over-sampling and in-domain pre-training further helps the performance, especially for No-Need and UV-Part. Although there is no clear winner among OS and IDPT, our final model, Full+IDPT+OS, that combines both these strategies performs the best for us, achieving an overall F1 score of 68.3% and 70.2% Joint Accuracy. Attention Visualization: To understand if the model learns meaningful representations, we visualize the task-specific self-attention layers of the trained Full+IDPT+OS model. We consider two instances in Figure 3.2. For meaningful comparisons, the instances were picked randomly from the pool of all utterances that contain two strategies. As evident, the model is able to focus on the most relevant parts for each strategy label. For instance, in the case of Other-Need, the 43 Model Small-Talk Self-Need Other-Need No-Need Elicit-Pref UV-Part Vouch-Fair Overall F1 F1 F1 F1 F1 F1 F1 F1 Joint-A Majority 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39.6 LR-BoW 64.6 57.2 43.2 17.5 56.5 14.3 50.4 43.4 52.4 BERT-FT 81.6 72.3 76.7 16.4 80.5 20.4 61.9 58.5 64.0 Multi-task training Freeze 81.0 69.1 69.5 14.8 77.6 9.2 66.3 55.4 65.8 No Attn 80.7 71.9 76.8 7.5 79.0 23.2 60.6 57.1 67.8 No Feats 82.7 75.1 78.8 37.8 82.4 46.2 66.8 67.1 69.9 Full 82.7 74.4 77.9 36.4 83.2 44.5 67.9 66.7 70.2 +OS 82.0 77.1 75.6 44.2 81.9 46.4 67.3 67.8 70.1 +IDPT 82.6 74.0 80.4 41.2 82.8 40.8 64.0 66.6 69.5 +IDPT+OS 82.6 75.2 78.8 46.2 81.8 47.3 66.1 68.3 70.2 Table 3.3: Performance on strategy prediction task for 5-fold cross validation. F1 score corresponds to the positive class. Figure 3.2: Visualizing task-specific self-attention layers for two examples from the test dataset for the first CV fold. The heatmap shows the attention scores for each token in the utterance for corresponding strategy labels. scores are higher when the participant talks about their kids needing more food. The token we gets the most attention, which is commonly used by the participants when referring to group needs. We see similar trends in the second case as well. Remarkably, this suggests that although our annotations are at an utterance level, it might be possible to automatically retrieve the most relevant phrases for any given strategy − this requires further investigation, which should be explored in the future. 44 3.5 Conclusion In this Chapter, we described the design and development of the CaSiNo dataset. In contrast to menu-driven interfaces that rely on restrictive communication based on button clicks (Section 2.3), our task allows free-form interaction between the negotiators based on natural language. The task focuses not just on ‘what the players prefer in the negotiation’, but also on ‘why do the players have these preferences’, eliciting motivations from their personal experiences in an easily relatable mixed-motive scenario. As our analysis shows, this leads to a collection of rich dialogues where players employ a variety of arguments as well as selfish to socialistic strategy behaviors while still interacting in a closed-domain MIBT environment. In the next Chapter, we will continue investigating the realistic behaviors observed in CaSiNo by analyzing the emotions expressed and their utility in making predictions about the final negotiation outcomes. 45 Chapter 4 A Case for Emotion-Aware Mixed-Motive Systems Negotiation is a core task for studying emotional feelings and expressions in human decisionmaking (Gratch et al., 2015). Being a mixed-motive task, it creates both interpersonal and intrapersonal conflicts for the negotiators. Motivational tensions often arise when negotiations pit aspirations for individual accomplishments against the demands of sustaining social connections. This leaves difficult decisions for the negotiators about working towards their own self-oriented outcomes or making sacrifices for others. Such situations can be fraught with emotional encounters. For instance, as a negotiator strives to get as much as possible for themselves, in doing so, they need their partner to go along as well. However, the partner’s willingness to go along is essentially an emotional decision. A negotiator that tries to take too much can annoy their partner, and in turn, hurt their likeability in the eyes of their partners and also the partner’s affective evaluation of the outcome (that is, their satisfaction). Instead, it is desirable for the negotiator to strive for maximum performance while ensuring that the partner is satisfied (Oliver et al., 1994) and leaves with a positive perception of the partner (Mell et al., 2019; Aydoğan et al., 2020). Maintaining a positive relationship with the partner is especially crucial in repeated interactions, where poor relations in earlier negotiations can adversely impact the results of future ones (Aydoğan et al., 46 2020). Therefore, predicting the partner’s satisfaction and perception in advance can be crucial for an AI assistant that aims to negotiate with its users. A number of prior studies rely on user individual difference attributes to explain negotiation behaviors and outcomes. For instance, this includes studies based on gender (Stuhlmacher and Walters, 1999) and personality traits such as Social Value Orientation and Machiavellianism (Xu et al., 2017). This research is crucial from the perspective of practical negotiation agents that interact with users on social media platforms since research suggests that demographic and personality attributes can be inferred from past user interactions (Ortigosa et al., 2011; Dong et al., 2014; Adalı and Golbeck, 2014). Thus, they can be used to adapt the behavior of the deployed automated systems. However, relying only on these variables misses out on all the other available information such as affective attributes in the negotiation itself, which may further help in predicting the outcomes. While one might expect that such affect variables would help merely because they are manifestations of individual differences, it is instead possible that affective factors reflect the recent interaction with the partner and might predict in their own right. This leads us to the key research question we aim to answer in this Chapter: For predicting a negotiator’s satisfaction and liking for their partners, is there value in leveraging affective attributes from the negotiation, above and beyond the individual difference variables? As discussed in Chapter 2, due to the inherent design restrictions in menu-driven interfaces, emotion expression merely reduces to using emoticons and predefined textual messages. Instead, we perform our analysis on the CaSiNo dataset discussed in the previous Chapter. The associated metadata in CaSiNo about the participants enables us to study individual differences and linguistic dialogue behaviors in the same setup. Going beyond what is allowed by menu-driven communication, we develop three degrees of emotion recognition techniques: emoticons, lexical, and 47 contextual, by leveraging recent advancements in deep learning (Section 4.1). We draw extensive qualitative and quantitative comparisons between these different methods. Using correlational and step-wise regression analysis, we quantify the extent to which affective attributes help to account for the variance in the outcomes, beyond just the individual user attributes (Section 4.3). We further validate our results by controlling for the objective performance of the participants in the negotiation and study the temporal effects of emotion expression on the final prediction performance. To understand the generality of our findings, we extend our analysis to a more complex dataset consisting of transcripts from face-to-face negotiations (Section 4.4). We conclude this Chapter by discussing how our analysis can guide the development of automated systems (Section 4.5). 4.1 Measures For our analysis in this Chapter, we define three clusters of variables: 1) Individual differences, which captures the demographics and personality of the participants, 2) Affect Variables, which comprises all the affective attributes extracted from the negotiation utterances, and 3) Negotiation Outcomes, where we describe the two primary dependent variables in our analysis. 4.1.1 Individual Differences These variables are based on the responses reported by the participants in a survey before their negotiation. The participants self-identified their demographic attributes while the personality traits are based on the standard tests from the psychology literature. 48 Demographics: There are two continuous variables: Age1 and Education. We encoded Education as a continuous variable, leveraging the inherent order in the highest level of education. It takes a value from 0 to 8, with an increasing level of education. Further, there are two discrete demographic variables: Gender and Ethnicity. For our regression analysis, we dummy-coded these variables based on the categories discussed earlier. Personality: There are two available measures of individual personality differences: Social Value Orientation or SVO (Van Lange et al., 1997) and the Big-5 personality traits (Goldberg, 1990). SVO is a discrete variable – A participant can either be categorized as Prosocial, who tend to be cooperative in their interactions, or Proself, who tend to serve their individual interests. In the CaSiNo data collection study, SVO was computed using the Triple Dominance Measure (Van Lange et al., 1997). 463 participants were classified as Prosocial, 364 as Proself, while 19 were Unclassified. The Big-5 test consists of five personality dimensions: Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experiences. These dimensions were computed using the Ten-Item Personality Inventory (Ehrhart et al., 2009). Each dimension takes a value between 1 and 7. 4.1.2 Capturing Affect Natural language based negotiations provide exciting avenues for Affective Computing research. Unlike menu-driven systems, where communication is highly restricted, language allows free expression of emotion and other affective attributes. In order to capture these attributes in CaSiNo, we consider three different degrees of affect recognition techniques: emoticon counts, use of 1One participant reported the age of 3, which we believed to be in error and was removed from all our analysis that uses Age. 49 emotive vocabulary based on affect lexicons (Pennebaker et al., 2001), and utterance-level emotion expressions based on a pretrained deep learning model called T5 (Raffel et al., 2020). We discuss these approaches below. We will later compare these methods through our regression analysis for predicting satisfaction and liking. Emoticons: Emoticons provide a structured way for emotion expression. Due to this reason, emoticons have been heavily used in menu-driven negotiation research. For instance, the IAGO platform for human-agent negotiations allows human participants to use emoticons via button clicks (Mell and Gratch, 2016). For CaSiNo, the participants were allowed to use shorthands for typing emoticons directly in the chat interface for four basic emotions: Joy, Sadness, Anger, and Surprise (see screenshots in Appendix A.7). For a given participant and the negotiation dialogue, we count the number of emoticons used by the participant for each of these types and use that as a continuous measure of emotion expression. Approximately 15% of the utterances in CaSiNo make use of one or more emoticons. Among those utterances that use emoticons, nearly 80% use the Joy emoticon. Based on a manual inspection of the utterances, we find that the participants tend to use Joy emoticon in a number of scenarios, including small talk at the beginning or end of the conversation when showing agreement, and in some cases, interestingly, also when rolling out a strict offer. Sadness tends to be used when showing disagreement, showing empathy, or emphasizing personal needs. Some cases where the participants expressed Surprise are when pointing out strange behavior from their partner or showing empathy towards a specific need of their partner. Anger is used in cases of strong disagreement where for instance, the partner does not empathize with the personal requirements of the participants or when they receive a very unfair offer that largely benefits the negotiation partner. 50 LIWC: We now go beyond the capabilities of menu-driven systems by extracting emotion attributes from the text in the utterances of the participants. Specifically, we look at word-level expressions by leveraging affect lexicons from the literature. We make use of the Linguistic Inquiry and Word Count lexicon, popularly referred to as LIWC in the literature (Pennebaker et al., 2001). LIWC consists of a number of word vocabularies corresponding to everyday use that reflect our feelings and personality. Given our focus on affective attributes, we use four LIWC dimensions in this work: Positive Emotions, Sadness, Anger, and Anxiety. To compute the corresponding scores for a participant in a given negotiation dialogue, we simply count the number of words in the utterances of the participant that are present in the corresponding LIWC lexicons. As one might expect, we find a number of instances where the participants express emotions without explicitly making use of emoticons. For instance, an excerpt where the participant clearly expresses anger without using the anger emoticon: ‘Are you mad. with out water what we do...’. Hence, with LIWC lexicons, we are able to capture more emotive utterances than by simply looking at the emoticon usage. Nearly 60% of the utterances make use of at least one emotive word from our four LIWC categories. Since LIWC variables are based on word usage, they provide a highly interpretable way of extracting emotions directly from the text. However, merely relying on word-level traits can still miss out on phrase-level or utterance-level emotion expressions. Hence, we next discuss how we make use of a state-of-the-art deep learning model from Natural Language Processing (NLP) to identify emotion in CaSiNo dialogues. Utterance-level emotion: Large deep learning models based on the highly parallelizable Transformer architecture have gained enormous popularity in the field of NLP in recent years (Vaswani et al., 2017). These models have achieved huge success on a variety of tasks ranging from sentiment 51 analysis to machine translation, fundamentally changing the way researchers approach these tasks. The primary way of training a model for a specific task consists of two steps: 1) Pre-train a large language model with freely available text data on the web in an unsupervised or a self-supervised manner, and then 2) Fine-tune the model on the task-specific dataset based on supervised learning. Leveraging these recent advancements, we make use of the Text-to-Text Transfer Transformer (T5) model (Raffel et al., 2020). T5 is an encoder-decoder architecture that is pretrained on a variety of NLP tasks by converting all the tasks in the same text-to-text format. No matter the task, the input to the model is a text sequence along with a task identifier prefix, and the target output is also a text sequence. Classification tasks can also be converted in this way by simply using a target output sequence that consists of a single word corresponding to the ground-truth label. This approach swiftly unifies a number of tasks, paving the way for parameter sharing throughout the model architecture. T5 achieves state-of-the-art performance on a number of NLP tasks, such as classification, text summarization, and question answering. For our purpose, we experiment with two variants of the T5 model, fine-tuned on two different emotion classification datasets. The first dataset comes from the annotated social interactions on Twitter (Saravia et al., 2018). The dataset categorizes each English input sentence into six emotion classes: Joy, Love, Sadness, Fear, Anger, and Surprise. It contains 16k training, 2k validation, and 2k test instances. Instead of fine-tuning the T5 model ourselves, we directly use the publicly available model checkpoint in the Huggingface library2 . The model performs well on the test set of this data, achieving an accuracy of 93% and a macro F1 score of 90%. We refer to this model as T5-Twitter in the rest of the discussion. 2https://huggingface.co/mrm8488/t5-base-finetuned-emotion 52 We use the GoEmotions dataset (Demszky et al., 2020) for the second model variant. GoEmotions is a much larger data source that consists of 58k English Reddit comments, each annotated with one or more labels from a set of 28 granular emotion classes. Since we could not find an already available T5 model trained on this dataset, we performed this fine-tuning ourselves. To pre-process the data, we first map the emotion labels to 6 Ekman dimensions (Joy, Disgust, Sadness, Fear, Anger, and Surprise), using the mapping released by the authors. Next, for uniformity with T5-Twitter and for simplicity while training the T5 model, we filter out the instances labeled as Neutral and further remove the small proportion of instances labeled with more than one emotion class. Finally, after a random split, this leads to 26758 instances for training, 3345 for validation, and 3345 for testing, which is nearly 67% larger than the Twitter data described above. To fine-tune the T5 model, we keep the same hyperparameters as the T5-Twitter model for fair comparisons. The model is trained for 2 epochs using a batch size of 128. We use a learning rate of 0.0003 with the standard AdamW optimizer. The resulting model achieves good performance on the GoEmotions test data, with an accuracy of 83% and macro F1 score of 72%. We refer to this model variant as T5-Reddit in the rest of the discussion. To use these models on the CaSiNo dataset, we preprocess the utterances by removing all the emoticons and then use them as inputs for both T5 variants. The output prediction of these models is based on the confidence scores that they generate for all the corresponding emotion labels, such that the label with the highest confidence is chosen as the prediction. In our case, since we are primarily interested in extracting the emotion scores for the utterances, we do not rely on the final predictions of the model. Instead, we directly use these confidence scores (probability values between 0 and 1) as the corresponding scores for each label. Given a negotiation dialogue and a 53 participant, we simply sum up these confidence scores for all the participant’s utterances and use these as the continuous measures for the corresponding six emotion dimensions. All the approaches discussed above for measuring emotion expression in dialogues work in fundamentally very different ways. To gain further insight into these methods, in Section 4.2, we will analyze the relationship between them through a combination of quantitative and qualitative techniques. First, we complete our description of measures by elaborating on the subjective negotiation outcomes. 4.1.3 Outcomes We focus on two outcome variables: 1) Satisfaction (How satisfied are you with the negotiation outcome?), and 2) Liking (How much do you like your partner?). Both variables were self-reported by the participants after the negotiation, using a 5-point Likert scale. We coded the responses from 1 to 5 to create the continuous measures. We present the means and standard deviations for all the continuous variables in Table 4.1. 4.2 Analysis of Emotion Variables Before presenting our results for predicting satisfaction and liking, we first validate T5-Twitter and T5-Reddit measures by understanding how they relate to Emoticon and LIWC. 4.2.1 Correlation Among Emotion Dimensions For correlational analysis, we only focus on the dimensions that appear in all four models: Joy, Sadness, and Anger. For LIWC, we consider the Positive dimension. We provide the full tables in 54 Variable Mean Std. Correlations Satisfaction Liking Individual Differences Age 36.24 10.32 .110** .158** Education 5.37 1.62 −.004 −.011 Extraversion 3.74 1.70 .055* .061** Agreebleness 5.34 1.28 .068** .098** Conscientiousness 5.68 1.21 .026 .045* Emotional Stability 5.04 1.55 .062** .086** Openness to Experiences 5.05 1.28 .027 .041 Emoticons Joy .77 1.24 .040 .075** Sadness .13 .42 −.151** −.170** Anger .02 .18 −.074** −.109** Surprise .07 .28 −.062** −.059** LIWC Positive Emotions 5.59 3.38 .034 .062** Sadness .22 .52 −.056* −.090** Anger .05 .25 −.047* −.043 Anxiety .18 .52 −.040 −.016 T5-Twitter Joy 3.89 1.09 .010 .026 Love .22 .21 −.010 −.028 Sadness .35 .42 −.141** −.172** Fear .44 .43 −.121** −.117** Anger .82 .66 −.217** −.296** Surprise .03 .08 −.011 −.016 T5-Reddit Joy 3.70 1.19 .011 .030 Disgust .01 .03 −.123** −.169** Sadness .20 .36 −.119** −.179** Fear .07 .24 −.041 −.036 Anger .26 .42 −.239** −.340** Surprise 1.54 1.04 −.098** −.108** Negotiation Outcomes Satisfaction 4.17 1.03 1 .702** Liking 4.11 1.12 .702** 1 Table 4.1: Statistics and correlations with outcome variables for all continuous measures. * denotes < 0.05. ** denotes < 0.01. 55 Appendix B.1 for reference and briefly discuss the overall observations here. In general, we find significant positive correlations between the emotion dimensions and their own counterparts across the four models, which aligns well with our expectations and validates the model predictions on the CaSiNo dataset. We observe that although significant, most correlation values are weak to moderate, indicating the fundamental differences in what is captured by these models. This holds true even for correlations between T5-Twitter and T5-Reddit dimensions, indicating that the two fine-tuned models learn to capture emotions differently, primarily due to the domain differences in the underlying datasets. Finally, we find that Anger shows significant positive correlations with Sadness as well. Since these measures are based on the confidence scores of the deep learning models, this suggests that the models might get confused between these two negative emotion dimensions. 4.2.2 Lexical Correlates of Emotion To gain more insights, we perform additional qualitative analysis by generating lexical correlates for the emotion measures. We assigned an emotion label to each utterance for all methods. For Emoticons and LIWC, this was based on the label that is reflected in majority. For T5-Twitter and T5-Reddit dimensions, we simply used the final predictions from the model. We next compute the log-odds ratio, informative Dirichlet prior (Monroe et al., 2008) of all the tokens for each emotion dimension relative to all other dimensions. This allows us to study the associations between the tokens and each emotion dimension. We provide the top five words for each category in the Appendix B.2 for reference. In general, we observe a number of highly significant associations across all categories that align with our intuition, such as words like ‘awesome’, ‘great’, ‘thank’ for 56 Joy and Positive Emotions categories. We found these words to be common at the beginning and end of the negotiation, where participants engage in small talk, resulting in such utterances being classified as Joy or Positive. Further, the associations are higher for LIWC and T5-based methods, which, as we observe, are able to recognize many other emotive utterances. Although significant, we observe that some words are less meaningful such as in the case of Anger. This suggests that some emotion dimensions may require more utterance-level context for better interpretation. 4.2.3 Sample Predictions For a more contextual analysis, we explicitly observe the sample predictions in this Section. We still focus on the three dimensions that are the most common and are present in all methods. Table 4.2 presents a few of the most confident predictions. To analyze the utterances that are specifically captured by the deep learning models, we only look at common high confidence predictions of T5-based models, which are undetected by Emoticon and LIWC methods. As the table depicts, the T5 models are able to capture contextual emotion beyond just lexical or emoticon usage. Based on these observations, we expected the affective attributes extracted using the deep learning models to be better predictors than Emoticons and LIWC measures. Further, since T5-Reddit is trained on a much larger emotion dataset, one can naturally expect this model to be superior to all others in capturing intricate emotion in the utterances. 4.3 Results We now present our main results. 57 Prediction Sample Utterances Joy 1) I think that sounds reasonable to me. 2) I can make that deal work. 3) Yes I think that would be a workable solution. I like your compromise. Sadness 1) I feel like this deal just keeps getting worse for me. I won’t take less than 2 waters. 2) That is unfortunate. How about I give you all my firewood for all your food? 3) I am disabled and trying to find water in the forest will be very difficult. Anger 1) You are not getting everything, thats just selfish 2) I am about to walk, you are being so unfair 3) You gave me one on each, thats unfair man! Table 4.2: High confidence sample predictions for the deep learning models that went undetected by both Emoticon and LIWC methods. 4.3.1 Correlation with Outcomes Table 4.1 summarizes the correlations of all the continuous measures defined in Section 4.1 with the negotiation outcomes. Overall, we observe a number of significant trends across individual differences and emotion variables. We find that satisfaction and partner liking both improve with age (=0.110, < 0.01). The outcomes are also positively correlated with Agreeableness and Emotional Stability ( < 0.01). We observe that positive emotions tend to relate positively with both the outcomes, showing significant trends for liking in the case of Emoticon usage (=0.075, < 0.01) and LIWC (=0.061, < 0.01). We do not observe similar results for T5-based models. This may be due to the inherent bias that the deep learning model shows towards Joy label, reducing its precision. However, trends show significant negative correlations for both Sadness and Anger consistently for both deep learning methods. These emotions are usually expressed 58 when the negotiation is not going favorably. This is naturally somewhat associated with lower satisfaction of the participants and liking for their negotiation partners. Among the discrete variables, we find that females are significantly more satisfied than males (=3.6, < 0.001) and significantly like their negotiation partners more (=4.1, < 0.001). We do not observe any significant associations with SVO and Ethnicity, based on a t-test and one-way ANOVA respectively. 4.3.2 Regression Analysis We now discuss the results of our regression analysis that answers whether emotion variables extracted from the negotiation dialogue collectively explain more variance in satisfaction and liking, above and beyond the individual difference variables. To achieve this, we perform regression with three steps where each subsequent step incorporates the following set of variables: 1) Individual differences, 2) Affect variables of the participant, and 3) Affect variables for the negotiation partner. The affect variables come from one of the emotion methods defined in Section 4.1, which also helps to compare them. As discussed earlier, these variables only encompass the information about the negotiation that would be entirely visible to an automatic negotiation agent, either explicitly from the negotiation dialogue or implicitly by inferring the individual difference attributes from past online behaviors of the users on social media platforms. Hence, our insights from such an analysis can practically aid in designing sophisticated negotiation agents that incorporate user satisfaction and their perception of the agent itself in their modeling. Table 4.3 and 4.4 summarize the results for satisfaction and liking, respectively. Overall, we observe that the predictions are highly significant by just using the individual difference variables. 59 However, incorporating the affective attributes of the participant and the negotiation partner explains significantly more variance. For satisfaction, the individual difference variables account for minimal yet significant variance (F(14, 1997)=3.46, < .001, R2=.024). Among all the emotion methods, we find that T5-Reddit dimensions explain the most variance. Adding the participant’s affect variables based on the T5-Reddit model in the second step helps to account for a much higher variance (F(20, 1991)=11.62, < .001, R2=.105), such that this increase in the proportion is itself highly significant (ΔF(6, 1991)=29.98, < .001, ΔR 2=.081). Yet, further variance is explained when the partner’s affect variables are incorporated (F(26, 1985)=12.11, < .001, R2=.137: ΔF(6, 1985)=12.41, < .001, ΔR 2=.032). Similar trends can be observed for liking as well. Individual difference variables alone account for a significant proportion (F(14, 1997)=6.05, < .001, R2=.041) but adding the participant’s T5-Reddit attributes further improves the prediction (F(20, 1991)=22.94, < .001, R2=.187) with a significant increase (ΔF(6, 1991)=59.85, < .001, ΔR 2=.147). Finally, when the partner’s attributes are also incorporated as predictors, this again shows significant improvements in the explained variance (F(26, 1985)=24.49, < .001, R2=.243: ΔF(6, 1985)=24.28, < .001, ΔR 2=.056). Lastly, we note that in line with our linguistic analysis, the ability for T5-Twitter and T5-Reddit to capture more contextual emotive utterances helps to account for more variance, as compared to Emoticons and LIWC independently. In our analyses, we also observed that even when Emoticons and LIWC are combined, just using T5-Reddit still achieves much better results. We further explore this comparison between emotion recognition techniques in Section 4.3.4, where we analyze these methods against each other. 60 Variables R 2 df F R 2 Change F Change Emoticons Individual Differences .024 (14, 1997) 3.46*** − − +Participant Affect .051 (18, 1993) 5.92*** .027 14.20*** +Partner Affect .063 (22, 1989) 6.04*** .012 6.34*** LIWC Individual Differences .024 (14, 1997) 3.46*** − − +Participant Affect .032 (18, 1993) 3.61*** .008 4.07** +Partner Affect .051 (22, 1989) 4.85*** .019 10.14*** T5-Twitter Individual Differences .024 (14, 1997) 3.46*** − − +Participant Affect .095 (20, 1991) 10.41*** .071 26.02*** +Partner Affect .125 (26, 1985) 10.88*** .030 11.38*** T5-Reddit Individual Differences .024 (14, 1997) 3.46*** − − +Participant Affect .105 (20, 1991) 11.62*** .081 29.98*** +Partner Affect .137 (26, 1985) 12.11*** .032 12.41*** Table 4.3: Regression results for predicting Satisfaction. ** denotes < 0.01. *** denotes < 0.001. 4.3.3 Controlling for the Objective Performance Apart from the subjective performance metrics of satisfaction and liking, the CaSiNo dataset also provides information on the objective performance of the participants in terms of their points scored at the end of the negotiation. Each item of High, Medium, and Low priority is worth 5, 4, and 3 points respectively. The total points scored by the participants can be computed based on how many items of each kind they are able to negotiate for. One might argue that the subjective outcomes of satisfaction and liking could be explained by objective performance in the negotiation, and, moreover, that the predictive power of the emotion variables seen above (in Section 4.3.2) is merely a result of the relationship between emotion variables and the objective negotiation performance. That is, all of the emotions that the participants express might just reflect how much they are getting in the negotiation, and if 61 Variables R 2 df F R 2 Change F Change Emoticons Individual Differences .041 (14, 1997) 6.05*** − − +Participant Affect .080 (18, 1993) 9.68*** .040 21.50*** +Partner Affect .097 (22, 1989) 9.70*** .017 9.12*** LIWC Individual Differences .041 (14, 1997) 6.05*** − − +Participant Affect .055 (18, 1993) 6.39*** .014 7.32*** +Partner Affect .072 (22, 1989) 7.06*** .018 9.56*** T5-Twitter Individual Differences .041 (14, 1997) 6.05*** − − +Participant Affect .154 (20, 1991) 18.16*** .114 44.58*** +Partner Affect .200 (26, 1985) 19.07*** .046 18.83*** T5-Reddit Individual Differences .041 (14, 1997) 6.05*** − − +Participant Affect .187 (20, 1991) 22.94*** .147 59.85*** +Partner Affect .243 (26, 1985) 24.49*** .056 24.28*** Table 4.4: Regression results for the prediction of Liking. *** denotes < 0.001. that were the case, then the emotion variables would no longer predict the subjective outcomes when controlling for the points that the participant scored in the negotiation. Alternatively, we propose that emotion expression provides more signals (in terms of satisfaction and liking) than just how much the participants are getting in the agreed deal, and thus, the emotion variables should continue to predict significant variance above and beyond objective performance. We assess this hypothesis through a modified regression, where we consider the points scored by the participant in the first step, followed by the same three steps used in Section 4.3.2. The full result tables for this analysis have been provided in the Appendix B.3 for reference. We briefly discuss the significant trends here. Overall, we observe that although points scored explain significant variance in both satisfaction and liking metrics by themselves, we still observe similar significant trends for incorporating emotion variables, as before. 62 For satisfaction, the points scored themselves do explain relatively high variance (F(1, 2010)=343.88, < .001, R2=.146), but the results still show high significant improvements with the incorporation of individual differences and subsequent affect variables, achieving the best results with T5-Reddit (F(27, 1984)=23.78, < .001, R2=.245). In comparison, this is nearly twice the maximum variance explained by not incorporating the points scored into the prediction model, as reported in Section 4.3.2 (F(26, 1985)=12.11, < .001, R2=.137). For liking, the points scored account for a significant but much lower variance itself (F(1, 2010)=167.38, < .001, R2=.077). One explanation for this effect can be that the reported satisfaction of the participants is geared towards their own outcomes, while their liking is geared towards their relationship with their negotiation partners. While this shows that their overall perception towards their partners can improve with their own negotiation performance, the results can be hugely improved with the incorporation of affect variables, with T5-Reddit still proving to be superior (F(27, 1984)=29.14, < .001, R2=.284). As hypothesized, these observations show that the emotion variables continue to predict significant variance above and beyond objective performance and, therefore, provide more signals (in terms of satisfaction and liking) than just how many points the participants are getting in the negotiation. This points to the value of extracting additional variables from the negotiation dialogue to better capture the subjective outcomes in a negotiation. In Section 4.5, we will briefly discuss a number of other variables suggested by the prior work, which can further improve this performance. 63 4.3.4 Comparing Emotion Recognition Methods Our results indicate that emotion variables based on the deep learning model T5 better explain the variance in satisfaction and liking compared to those based on other emotion recognition methods. However, till now, we have only compared the different types of emotion variables by looking at their predictive power in separate regression models. Accordingly, it is reasonable to expect that in addition to T5-based variables explaining more variance than other types of emotion variables (based on emoticons and LIWC), T5-based variables might also explain the same variance in outcomes (and more) that Emoticon and LIWC variables do. That is, the deep learning models might provide unique cues about the subjective outcomes but they also explain the variance that is explained by other methods. Alternatively, it is possible that Emoticon and LIWC variables might tap into different aspects of satisfaction and liking than deep learning models. In this case, incorporating Emoticon and LIWC variables should continue to predict significant variance above and beyond deep learning based variables. To test this possibility, we now enter these types of emotion variables as predictors in the same regression. To this end, we design another regression with three steps: 1) Individual differences, 2) All T5-Reddit variables, and 3) All Emoticon and LIWC variables. For simplicity, here we combine the affect variables from the participant and the partner together in the same step. Among the deep learning models, we only focus on T5-Reddit in this case due to its superior performance. The results are summarized in Table 4.5. First, we note that T5-Reddit variables themselves show huge improvements above only the individual difference variables, as already established by the analyses presented in the previous sections. Further, with the incorporation of Emoticon and LIWC, we find that the prediction model shows minimal but still significant improvements above 64 just using T5-Reddit, for both satisfaction (F(42, 1969)=8.30, < .001, R2=.150: ΔF(16, 1969)=1.96, < .05, ΔR 2=.014) and liking (F(42, 1969)=16.05, < .001, R2=.255: ΔF(16, 1969)=2.01, < .05, ΔR 2=.012). These results show that Emoticon and LIWC do provide some unique value to the model, which is not already captured by T5-Reddit variables, indicating that automated systems can benefit from leveraging these different methods of emotion recognition collectively, by extracting emotion along the emoticons, lexical, and contextual dimensions. However, we point out that these effect sizes are larger when we experiment with T5-Twitter method instead of T5-Reddit. This provides evidence that more sophisticated deep learning architectures and trained models may further close this gap. Hence, we encourage researchers to explore this further in the future. Variables R 2 df F R 2 Change F Change Satisfaction Individual Differences .024 (14, 1997) 3.46*** − − +T5-Reddit .137 (26, 1985) 12.11*** .113 21.71*** +Emoticons+LIWC .150 (42, 1969) 8.30*** .014 1.96* Liking Individual Differences .041 (14, 1997) 6.05*** − − +T5-Reddit .243 (26, 1985) 24.49*** .202 44.17*** +Emoticons+LIWC .255 (42, 1969) 16.05*** .012 2.01* Table 4.5: Regression results for predicting Satisfaction and Liking with emotion recognition techniques pitted against each other. We only consider the best-performing deep learning model (T5-Reddit) for this analysis. Further, we combine the affect variables corresponding to the participant and the partner in the same step. * denotes < 0.05. ** denotes < 0.01. *** denotes < 0.001. 4.3.5 Studying Temporal Effects Finally, we conduct an analysis to study the temporal effects of emotion expression on the prediction of outcomes. Our primary objective here is to guide the design of an automated negotiation 65 Variables R 2 df F R 2 Change F Change Satisfaction Individual Differences .024 (14, 1997) 3.46*** − − +First Half Affect .091 (26, 1985) 7.69*** .068 12.35*** +Second Half Affect .144 (38, 1973) 8.76*** .053 10.16*** Liking Individual Differences .041 (14, 1997) 6.05*** − − +First Half Affect .161 (26, 1985) 14.66*** .120 23.74*** +Second Half Affect .253 (38, 1973) 17.55*** .092 20.13*** Table 4.6: Temporal analysis for predicting Satisfaction and Liking. We only consider T5-Reddit dimensions due to their superiority in other experiments. We combine the affect variables corresponding to the participant and the partner in the same step. *** denotes < 0.001. agent. We aim to understand whether emotion expression at the start of the conversation is sufficient to build the prediction model or whether a more continual learning approach is preferred. Since the number of turns in each dialogue is relatively small in CaSiNo, it is less meaningful to study the variation in emotion expression at a granular level. Hence, we divide the dialogues into two halves and compute the emotion variables on each half separately. Further, we focus only on T5-Reddit variables since they do not rely on the usage of emoticons and specific lexical usage that can be limited when considering a portion of the dialogue. To answer our research question, we consider the individual differences, the first half affect, and the second half affect variables as the three steps in our regression model. We present the results in Table 4.6. As evident, for both satisfaction and liking, it is necessary to consider the affect attributes extracted from both halves of the dialogues to match the overall performance as reported in previous sections. For satisfaction, while just incorporating the first half affect shows significant improvements over individual differences (F(26, 1985)=7.69, < .001, R 2=.091), incorporating variables from the latter parts of the conversation shows further substantial gains (F(38, 1973)=8.76, < .001, R2=.144). We observed similar trends for liking prediction as 66 well. Simply incorporating the affect expression in the first half shows significant benefits but fails to match the variance explained when the entire dialogue was used (F(26, 1985)=14.66, < .001, R 2=.161), which is only matched after incorporating the affect expression from the second half as well (F(38, 1973)=17.55, < .001, R2=.253). These findings point towards the importance of measuring emotion variables continuously for an automated agent. The agent’s design should incorporate the entire information as it is received and update its prediction of the subjective outcomes in a negotiation. 4.4 Do These Findings Generalize? Variables R 2 df F R 2 Change F Change Satisfaction Individual Differences .044 (11, 302) 1.26 − − +Participant Affect .096 (17, 296) 1.84* .052 2.84* +Partner Affect .133 (23, 290) 1.93** .037 2.06 Liking Individual Differences .033 (11, 302) .94 − − +Participant Affect .054 (17, 296) .99 .021 1.07 +Partner Affect .104 (23, 290) 1.47 .051 2.73* Table 4.7: Regression results on F2F dataset using T5-Reddit emotion model. * denotes < 0.05. ** denotes < 0.01. Our analysis of the CaSiNo dataset provides evidence of the utility of capturing emotion in negotiation dialogues. In order to see whether these findings generalize, we extend our analysis to a different multi-issue bargaining dataset. Specifically, we analyze the Face-to-Face negotiation dataset (F2F) released in prior work (Lucas et al., 2016). Each negotiation in this dataset involves two participants who role-play as owners of antique stores and negotiate to divide six antique items: three crates of LP records, two lamps, and one painting. In total, 183 interactions were 67 curated among same-sex dyads (73 female and 110 male). The average age of the participants is 46.80 (SD = 14.53), among which 36.6% racially identified as Caucasian, 35.8% as African-American, 10.4% as Hispanic, 8.5% as Asian, 6.3% as other, 2.4% as Native American or Hawaiian. On average, the interactions last for 131.8 (SD = 88.9) utterances, with individual utterance having 7.2 (SD = 7.5) tokens. We refer the readers to the original dataset paper for more details. We chose this F2F dataset due to the availability of rich participant meta-data and the recorded subjective performance metrics, which are essential for our analysis in this work. For participant demographics, we use Age, Gender, Ethnicity, and Education, with Gender and Ethnicity dummy-coded, as before. For personality, we use Social Value Orientation (as also in CaSiNo) and Machiavellianism (Mach). Mach is another personality trait commonly discussed in the context of competitive negotiations. It points to manipulative and goal-seeking behavior. Participants with a high Mach score tend to be overly rational or cold and are less influenced by emotional arguments, making it an interesting metric to capture for our analysis in predicting subjective negotiation outcomes (Fehr et al., 2013). To measure the emotion dimensions, we use the best-performing method from before (T5-Reddit) and apply it to the transcripts in the F2F data. Due to the high variance in the length of conversations, we further normalize these scores by dividing them by the total number of speaker utterances in a given negotiation. Finally, we describe how the final outcomes are measured. For satisfaction, we use the participant responses to the question - ‘How satisfied were you with the final agreement?’ on a 7-point scale from Very Dissatisfied to Very Satisfied. For liking, we use their responses to ‘What was your impression of the other party?’ on a 7-point scale from Negative to Positive. We treat all these variables as continuous measures. After removing any missing data, we end up with 314 data points for our final analysis. 68 We present the results for predicting satisfaction and liking in Table 4.7. Unlike before (Tables 4.3 and 4.4), we observe that individual difference variables do not significantly predict these subjective outcomes of the negotiation. Incorporating affect variables improves the explained variance in both satisfaction and liking while achieving significant results for the former metric. For satisfaction, results with only individual variables are not significant (F(11, 302)=1.26, > .05, R2=.044). However, incorporating participant affect variables significantly explains more variance (F(17, 296)=1.84, < .05, R2=.096: ΔF(6, 296)=2.84, < .05, ΔR 2=.052), and leads to further improvements after using partner affect variables (F(23, 290)=1.93, < .01, R2=.133: ΔF(6, 290)=2.06, > .05, ΔR 2=.037). For liking, combining individual difference and participant affect variables does not show significant prediction results (F(17, 296)=.99, > .05, R2=.054). Once partner affect variables are incorporated, the results are not significant (F(23, 290)=1.47, > .05, R 2=.104), but the improvements are (ΔF(6, 290)=2.73, < .05, ΔR 2=.051). The F2F data is challenging for multiple reasons that might explain less significant results in comparison to CaSiNo. First, the data is much smaller with missing values, leaving us with merely 15% of the data as compared to CaSiNo. Second, the data contains transcripts of spoken dialogue, which is inherently different from the written language that was used to train our T5 models, leading to a domain mismatch. And third, the F2F data involves multimodal interactions, with emotion expressed not just through text transcripts but also via speech signals, facial expressions, and body actions like ‘laughter’ and ‘sigh’, which have not been considered in our analysis. Hence, we still find the improvements observed in Table 4.7 as a promising signal that our findings are generalizable to other scenarios. We encourage researchers to explore this further in the future. One possible direction to improve the performance can be to match the domain of the training data and the final conversations, for instance, using emotion models trained on spoken 69 language. Another direction can be to build emotion models that are able to gather cues from multimodal inputs including facial expressions from images and videos, making them more suited for analyzing face-to-face interactions comprehensively. 4.5 Conclusion Our aim was to empirically investigate the extent to which the affect variables extracted from the negotiation itself are helpful for the prediction of two important metrics for negotiation agents – participant’s outcome satisfaction and liking for the partner. We presented an extensive analysis based on a large-scale dataset of human-human negotiation dialogues grounded in a realistic camping scenario. We devised three degrees of emotion dimensions, from leveraging emoticon usage similar to prior work in menu-driven systems to going beyond and extracting emotion expression directly from the textual chat utterances. Our results show that such affect variables explain variance in these subjective outcomes, above and beyond the impact of individual difference variables that are available before the negotiation begins. We find that the contextual emotion dimensions from the deep learning models, T5-Twitter and T5-Reddit, better account for the variance than emoticon or lexical level measures (Sections 4.2 and 4.3.4). One might expect that the affect variables would help merely because they are manifestations of individual differences such as social value orientation, agreeableness, or gender. If this was indeed the case, there would have been no need to incorporate such affective variables, but merely the individual difference variables would have sufficed. However, to the extent to which the affect variables help in predictions, above and beyond these individual differences, this suggests that 70 there is utility in extracting these emotion dimensions explicitly for improved agent performance, being especially well-suited for more subjective outcomes in a negotiation. We further note that the individual difference variables that we use are based on self-identification and standard personality tests from the psychology literature. This makes these variables more reliable than if they were to be implicitly inferred from past social behaviors of the users, as attempted in a number of previous works analyzing interactions on social media platforms (Ortigosa et al., 2011; Dong et al., 2014; Adalı and Golbeck, 2014). If affective factors show significant improvements above the individual differences measured with minimal error, this further attests to their utility when the demographics and personality information is inferred, and thus, less reliable. Since negotiations and other mixed-motive situations can be fraught with emotional decisions, our findings suggest that it would be useful for designers of agents that negotiate with humans to be armed with algorithms for understanding the unfolding emotions displayed by users in terms of their emoticon use and natural language. Our work suggests that demographic and personality attributes inferred from past social interactions (such as on social media) are not sufficient and that providing information to negotiating agents from affective channels tracked during the negotiation itself is important for developing agents that can predict or understand if their human counterpart is satisfied and likes them. Ultimately, outcomes like satisfaction and liking will be essential for such agents to cultivate if they, like human negotiators, hope to successfully negotiate with that same partner in future interactions (Aydoğan et al., 2020). We note that even after incorporating the emotion attributes along with the individual difference variables, the reported correlation values, although highly significant, are moderate. Explaining variance in such outcomes is challenging because real-world negotiations are complex, 71 and a myriad of factors can contribute. However, accordingly, models could explain more of the variance in negotiation outcomes by adding additional predictor variables. For example, as the results from Section 4.3.3 suggest, incorporating objective performance into the model explains more variance. Other possible predictors that could explain additional variance include prior expectations of the negotiators (Oliver et al., 1994), selfish linguistic strategies such as undervaluing their partners (Chawla et al., 2021b), rapport building (Nadler, 2004), favor exchange (Mell et al., 2015), and expression of empathy (Klimecki, 2019). Prajod and colleagues manipulated the warmth expressed by the agent through multiple modalities such as facial expressions and eye gaze. They found that higher warmth is related to increased satisfaction, even with a similar objective performance (Prajod et al., 2019). Combining these variables along with the affective information considered in this work provides exciting avenues for future research in automated agents, especially in the cases with face-to-face interactions as discussed in Section 4.4. In the next Chapter, we show how in addition to the analysis presented in this work, the available metadata in CaSiNo also allows us to explore data augmentation techniques that have the potential to reduce manual annotation efforts. We devise methods for data adaptations to train a deep learning model on the important task of Partner Modeling introduced in Chapter 2. 72 Chapter 5 Data Augmentation Improves Partner Modeling Figure 5.1: A simplified view of a multi-issue negotiation based on the scenario in CaSiNo (Chawla et al., 2021b). The negotiation involves 3 issues: Food, Water, and Firewood, each with 3 items that must be divided among the two players. From the perspective of player P1, the task of partner modeling considered in this work involves inferring the priority order of partner P2 from the interaction between the two. Figure 5.1 provides the structure of an instance in the CaSiNo dataset (Chapter 3). As evident, the priority order of the partner is typically unknown to negotiators beforehand and can only be inferred based on the interaction between the two. Prior work argues that understanding what one’s partner wants is one of the key aspects of successful negotiations (Baarslag et al., 2013). An accurate model of the partner can enable a dialogue system to roll out offers that work for both parties, which has implications on both its objective performance, such as the final points scored 73 from the agreed deal, and the subjective performance, such as partner’s satisfaction and affinity for the dialogue system. This can also aid in pedagogy by allowing the system to provide concrete feedback to students who fail to incorporate the priorities of their partners (Johnson et al., 2019b). Discovering these priorities from an interaction with a partner is usually referred to as Partner Modeling in the context of multi-issue negotiations1 . Information about a partner’s priorities can primarily be gathered from their preference and offer statements (Nazari et al., 2015). Sharing preferences by explicitly mentioning ‘We need water’ or more implicitly - ‘We like to go on runs’ can provide information that water is of high priority to the negotiator. Further, offers such as ‘I would like two food items and one water’ can imply that food is of a higher priority than water. Building techniques for partner modeling that are useful in realistic chat-based negotiations poses several key challenges: 1) It is non-trivial to directly use counting-based methods on these preference and offer statements, which are common in prior work that does not use natural language, such as agent-agent negotiations (Williams et al., 2012) and human-agent negotiations based on button clicks (Section 2.3), 2) To alleviate this problem for language-based interactions, prior work has resorted to gathering additional utterance-level annotations to convert the desirable information into a more structured format, that can then be used with counting methods (Nazari et al., 2015). However, this approach remains expensive, requires expertise, and hurts generalizability. Further, these annotations are unavailable for systems that are deployed to end users, needing a separate NLU module that can potentially lead to error propagation in the downstream dialogue system pipeline, and 3) Some real-world applications require the system to guess the 1Partner Modeling is, in fact, a broader term encompassing any attempt to infer the mental states of the interaction partner. In the context of mixed-motive interactions, it can include inferring priority estimation (which is the focus of this work), predicting partner limits like BATNA (Sebenius, 2017), and classifying partners into categories such as based on personality traits (Baarslag et al., 2016; Albrecht and Stone, 2018). 74 partner’s priorities with only partial dialogue so as to inform the future decision process of the system - a scenario which has not been well explored in prior works. To address these challenges, we propose a transformer-based (Vaswani et al., 2017) hierarchical ranker for partner modeling in negotiation dialogues (Section 5.1). Our model takes a partial dialogue as input and guesses the partner’s priority order. Instead of relying on utterance-level discourse information, we devise simple and effective ways to project related data sources to this task (Section 5.2). As opposed to multi-task learning, which typically involves task-agnostic and task-specific parameters along with back-to-back fine-tuning procedures that suffer from catastrophic forgetting issues, our adaptations augment the training data available to the model, allowing end-to-end joint learning and parameter sharing. For a comprehensive evaluation that serves multiple downstream applications, we propose three evaluation metrics for this task (Section 5.3). Our experiments are based on two dialogue datasets in English, showing the utility of the proposed methodology with complete or partial dialogue as input in full, few-shot, and zero-shot scenarios (Section 5.4). 5.1 Problem Formulation Consider a negotiation between two parties over issues. We define the problem from the perspective of a specific negotiator (referred to as self, hereafter) and aim to predict the priority order of the partner (see Figure 5.1). Assume that contains an alternating sequence of utterances between the negotiator self and the partner . The partial interaction is , which is obtained after observes utterances from the partner.2 The goal is to build the model , 2 will contain either 2 or 2 − 1 utterances, depending on who starts the conversation. 75 with = ( ), where is the desired priority order of the partner. In our experiments, we consider metrics that measure the performance for the complete dialogue and for different values of (Section 5.4). 5.2 Methodology Figure 5.2: Our proposed methodology for partner modeling in negotiation dialogues. The approach involves three main components: Section 3.1 describes our core hierarchical encoder that takes in a partial dialogue and outputs the partner priority order after seeing each utterance, Section 3.2.1 covers the adaptation of an argument-centric dataset (CA data) targeted towards better modeling the preference statements of the partner, and Section 3.2.2 describes the adaptation of an offer-centric dataset (DND data) targeted towards the offer statements of the partner. We present our approach in Figure 5.2, which contains three main components: a hierarchical core model that takes in a partial dialogue and outputs the desired ranking order and two modules for data adaptation that are designed to better model the preference and offer statements of 76 the partner. We first describe our core model, assuming a general input, and then describe the proposed data augmentation techniques. 5.2.1 Hierarchical Encoder Our encoder (orange segment from Figure 5.2) uses two levels to build contextual utterance representations, which are then used to output a score for each of the issues, representing the ranking order among them. Utterance Encoder: First, a sentence-level module (Level I) encodes each utterance U = [1,2, . . . , ] separately. We prepend the utterances with a special token to indicate the author: <self> or <partner>. To encode a contextually-rich representation, our level I encoder uses pretrained language models (Devlin et al., 2019; Liu et al., 2019), given their success across a wide range of NLP tasks, especially in low resource settings on similar NLU tasks (Balaraman et al., 2021). For each utterance , the pretrained model first embeds the input words into the embedding matrix ∈ R × . After passing through the encoding layers, the pretrained model outputs -dimensional word representations ∈ R × . Finally, this is followed by pooling to obtain the utterance representation ∈ R . The Level I output is essentially the conversation matrix ∈ R × , which is obtained after processing all the input utterances. Dialogue Encoder: At Level II, we use a transformer block with masked self-attention (Vaswani et al., 2017). Self-attention enables efficient interactions for encoding partial conversations. A target utterance is only allowed to use the information from previously seen utterances, which is accomplished by masking all the future utterances in the dialogue. In a single transformer layer, each target utterance query simultaneously assesses and encodes the information from all the 77 unmasked key utterances, resulting in a contextualized representation of each utterance - the matrix ∈ R × . Output Layers: Finally, a feed-forward network acts on to output an -dimensional representation for each utterance. This represents the scores for each of the issues that the model is trying to rank. We then apply the sigmoid operation to constrain each score between 0 and 1, resulting in the output ∈ R ×. In comparison to text ranking tasks, where the set of items that are being ranked is large and can be dynamic, the set of issues in realistic multi-issue negotiations is usually small and fixed. Hence, we predict the scores for each of these issues together, unlike text ranking literature where each item is ranked separately (Yates et al., 2021). Training: We employ the pairwise margin ranking loss to train our model in an end-to-end manner. The loss L after observing utterances from the partner is defined as: L = ∑︁ =(1,2)∈ ( 1 , 2 , ), (5.1) where is given by: ( 1 , 2 , ) = (0, − ( 1 − 2 ) + ). (5.2) represents the set of all possible pairs of issues. 1 and 2 are the scores from the final layer of the hierarchical ranker after applying the sigmoid operation. captures the ground truth ranking between 1 and 2. is equal to +1 when 1 should be ranked higher (has a larger score) than 2, and it is kept as −1 otherwise. is the margin. 78 The objective of the ranking loss is to train the model to predict a higher score for the issue that is ranked higher by the ground truth priority order. A positive margin of ensures a nonzero loss if the score for the higher ranked item is not greater than or equal to its counterpart by , forcing the model to predict well-separated boundaries. We experimented with different values for , concluding that a nonzero margin is necessary for any meaningful training. For the results presented in this Chapter, we set as 0.3. Inference: Once the model is trained, the predicted scores can be used to output the desired ranking order for a given input dialogue. The model simply outputs the ranking of the issues by ordering them in decreasing order of these predicted scores. Note on the loss formulation: The pairwise ranking loss was chosen for its suitability and simplicity. However, other potential alternatives do exist. Since the number of issues is limited, one can remodel the prediction task as classification over all the possible orderings. However, this trivially does not capture that although two orderings can be wrong, one can be somewhat less wrong than the other. Hence, a ranking loss is more suitable for giving a smoother signal to the model during training, leading to a better performance in our initial experiments. We also explored more complicated ranking loss functions and a sequence-to-sequence model to directly generate the sequence of issues in their correct ranking order (Yates et al., 2021). We instead found the pairwise ranking loss to be effective and simple for our approach in this work that involves a limited set of issues and exploits partially masked loss functions (Section 5.2.2.1). Regardless, we encourage future work to explore these other formulations as well, depending on the task at hand. 79 5.2.2 Data Augmentations The transformer model discussed above learns to rank the issues directly from the partial dialogue as input without any additional supervision. Although this approach performs reasonably well in our experiments, it ignores the observations made in prior work which have primarily relied on annotations for preference and offer statements for partner modeling (Nazari et al., 2015). This suggests that more explicit feedback for extracting information from preferences and offers is one avenue for improving performance, especially in settings where the available dialogue data is scarce. Instead of gathering additional annotations, we devise alternate ways to better capture the preferences and offers in our hierarchical ranking model. We achieve this by adapting two additional data sources for this task, allowing the data to be directly added to the primary training dataset and enabling end-to-end parameter sharing between these related tasks. Datasets: We leverage two datasets in this work: CaSiNo (Chawla et al., 2021b) and DealOrNoDeal (Lewis et al., 2017). As discussed before, CaSiNo is grounded in a camping scenario, containing negotiations over three issues: food, water, and firewood. In addition to the dialogue, the dataset also contains metadata about the arguments used by the negotiators. DealOrNoDeal involves three arbitrarily defined issues: books, hats, and balls. Our main goal is to perform partner modeling for CaSiNo. To this end, we adapt DealOrNoDeal along with the available metadata in CaSiNo for data augmentation. We refer to the CaSiNo Dialogues as CD, CaSiNo Argument metadata as CA, and DealOrNoDeal dialogue data as DND. While the CD data can be used as it is with our model, we adapt the other two data sources (CA and DND) to make them suitable for our approach (see Figure 5.2). We now describe these adaptations. 80 5.2.2.1 Capturing Preferences In order to provide more direct supervision for the preferences, we leverage the metadata from CaSiNo (CA data), where the participants explicitly reported their arguments for needing or not needing a specific issue (blue segment from Figure 5.2). For instance, if food is the highest priority issue for a participant, they were asked to come up with an argument from their personal experiences as to why they would need food the most for camping (see Chapter 3 for more details). Example arguments are provided in Figure 5.2. The participants came up with a variety of such arguments covering Personal Care, Recreational, Group Needs or Emergency requirements (see Appendix A.2). The participants were then encouraged to leverage these arguments in their upcoming negotiations. This metadata can provide more direct feedback on which implicit preference statements can lead to a higher or a lower affinity towards a specific issue. To incorporate this, we create dummy dialogues using templates and add them to the training data for our partner modeling task. Consider a set of arguments = (, , ), containing one argument for igh, edium, and ow priorities respectively. We extract two pairs: (, ) and (, ) and construct the dummy dialogue as per Figure 5.2. 3 We ordered the arguments randomly to avoid any induced biases. For each constructed dialogue, we only have ground-truth ranking order for a single pair of issues. Hence, the pairwise loss function from Equation 5.1 needs special treatment to ignore the score of the issue that is not relevant to a given dialogue. More specifically, while training with these constructed dialogues, we partially mask the margin ranking loss to only consider the loss 3We skip the third pair due to an absence of a visible difference based on our qualitative analysis. 81 from the pair for which the relation is known. Further, since a partial dialogue is not meaningful in this case, we only train the model with L2 loss using =2. Although we use the readily available metadata from CaSiNo in our work, we believe that such contextual data can be constructed for other realistic domains as well, such as by leveraging appropriate domain-specific knowledge about the negotiators’ common requirements. 5.2.2.2 Capturing Offers To better capture the preferences in the previous section, our approach was to construct synthetic dialogues from a resource that primarily focused on implicit preference statements so as to teach the model in a more explicit manner. With a similar idea, we adapt DND dialogues to better use the offer statements (green segment in Figure 5.2). The DND dataset follows the same multiissue framework as CaSiNo, which enables our adaptation. Each dialogue in DND involves three arbitrarily-defined issues: books, balls, and hats. Due to the arbitrary nature of these issues, there is minimal context discussed in the dialogues, reducing it to essentially an exchange of offers from both sides (see example in Figure 5.2). Hence, such a resource can be used to provide more explicit supervision to learn from the offer statements of the partner. We map these dialogues to our dataset by randomly mapping the issues in this dataset to the issues in the target dataset, in our case, CaSiNo. We modify the utterances by replacing all the occurrences of the issues with the corresponding issues in CaSiNo. For this purpose, we find that simple regular expressions prove to be effective (Appendix C.2.1). Once mapped, this adapted data is simply added to the training data for our partner modeling task. Note on multi-issue negotiations: Our adaptation described above leverages the structural similarities between the two datasets. If the tasks follow a similar structure, it is relatively 82 straightforward to use adaptations as described above for other settings as well. This can be largely done with regular expressions, but if not, this relatedness still paves the way for multi-task learning. The negotiations in DealOrNoDeal and CaSiNo are based on a popular abstraction in the negotiation literature, referred to as the Multi-Issue Bargaining Task, or MIBT (Fershtman, 1990). MIBT is a generic framework that can be useful for many negotiation tasks beyond these datasets as well, for instance, salary negotiations, or negotiations between art collectors distributing the items among each other. It is extensively used in NLP (Lewis et al., 2017; Chawla et al., 2021b; Yamaguchi et al., 2021), beyond NLP (Mell and Gratch, 2017), and in the industry as well (e.g. iDecisionGames4 ). 5.3 Experimental Design We address the following research questions: RQ1) How useful is the proposed transformerbased ranker along with data augmentations for partner modeling in negotiation dialogues? We experiment with two pretrained language models and compare our ranker to standard baselines. To test the data augmentations, we analyze model ablations, including 0-shot and few-shot settings. We also observe if they lead to a better performance with a lower number of utterances. RQ2) Do preferences and offers contribute to the performance? To further shed light on the contributions of these utterances to the final partner modeling performance, we look at average attention scores on these utterances. Further, for a more explicit analysis, we observe whether the performance varies by the integrative potential in the negotiation, which essentially captures how aligned the preferences of the two negotiators are (Chawla et al., 2021b). Scenarios 4https://idecisiongames.com/promo-home 83 with low integrative potential are usually associated with a higher expression of preferences and offers. Hence, we expected the performance to be higher in the cases with low integrative potential. RQ3) How does our approach compare to a human expert? We compare our model to a human expert and recognize the errors made by the models. Datasets: Each data point in CD results in two dialogues for our analysis, based on the perspectives of the two negotiators (Section 5.1). We report results on 5-fold cross-validation for this dataset. We further leave out 100 dialogues from the training data for hyperparameter tuning, resulting in 1548 dialogues for training, 100 for tuning, and 412 for evaluation - for each crossfold. We extract CA from the metadata corresponding to the training data of CD, leaving out 200 constructed dialogues for validation (following Section 5.2.2.1). For DND data, we only select the dialogues with at least 4 total utterances and unique priority values for meaningful training. After adaption (following Section 5.2.2.2), we end up with 4074 dialogues for training and 444 for validation. All the models are primarily validated and tested on the corresponding subsets of CD (except for some additional analysis presented in Section 5.4). Evaluation Metrics: Our metrics are inspired by the negotiation literature, along with related research in Dialog State Tracking (DST) and Learning-to-Rank(LTR) tasks in NLP. Our primary metric is Exact Match Accuracy (EMA): the percentage of cases where the predicted priority order is entirely correct. This is analogous to the popular Joint Goal Accuracy in DST, which captures the cases where all the slots are correctly identified (Balaraman et al., 2021). For negotiation tasks, even knowing the topmost priority can be useful. Hence, we also report Top-1 Accuracy: the percentage of cases where the highest priority issue is correctly predicted. Finally, we report the Normalized Discounted Cumulative Gain (NDCG@3). NDCG has been widely used in LTR tasks with distinct relevance values (Yates et al., 2021), which is also true for the setting that we consider. 84 In our case, we use the relevance values as 5, 4, and 3 for the most, second, and least ranked issues respectively, following the incentive design structure of CaSiNo. We compute these metrics for all from 1 to 5, varying the number of partner utterances seen by the model. We present the results at =5 to analyze the performance after seeing almost all of the partner utterances in CaSiNo. To capture the performance with partial dialogues, we report corresponding -penalty versions that take a weighted average of the performance for different values of , while giving a linearly higher weight to the performance at a lower . Methods: We call the complete model from Figure 5.2 that combines all the three datasets for training as CD + CA + DND. We compare it with its ablations, including 0-shot and fewshot scenarios. We further develop two standard baselines. The Random baseline chooses the final ranking at random from all the possible orderings. BoW-Ranker is based on the Bag-ofWords paradigm. The input features are based on the normalized frequencies of the 500 most frequent words in the training dataset, except stopwords. Instead of contextualized hierarchical representations, this method directly uses a feed-forward network on the input BoW features to predict the ranking. The model is trained on partial dialogues using the same margin ranking loss. Training Details: The embedding dimension throughout is 768 for transformer-based models. These models use the base variant of either BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019) for the Level I encoder. The Level II encoder uses one transformer layer. The feed-forward network contains two fully connected layers with a final sigmoid activation. We train the model with Adam optimizer using a learning rate of 2 −5 for transformer-based methods and 2 −3 for BoW-Ranker. The margin is kept as 0.3. We use a dropout of 0.1 to prevent overfitting. We further employ a loss-specific dropout of 0.15 in order to backpropagate the loss from fewer s simultaneously. The models were trained for 20 epochs with a batch size of 25. We checkpoint 85 after every epoch, and the one with the highest EMA at =5 on the held out CD dataset is chosen for evaluation. We provide the details on the computing infrastructure, hyper-parameter tuning, and validation performance in Appendix C.1. Model k=5 k-penalty EMA Top-1 NDCG@3 EMA Top-1 NDCG@3 Random 16.46 (1.47) 32.49 (1.58) 48.49 (1.16) 16.59 (1.22) 33.99 (1.13) 49.76 (0.75) BoW-Ranker 28.49 (1.30) 53.38 (2.21) 65.51 (0.62) 27.71 (1.24) 52.98 (1.97) 64.31 (1.67) Bert-based DND 41.12 (3.06) 64.69 (2.94) 73.88 (1.57) 34.50 (1.12) 58.75 (1.35) 68.48 (0.77) CA+DND 41.90 (2.93) 66.98 (3.17) 75.91 (2.28) 36.01 (1.25) 61.09 (1.90) 70.09 (1.49) CD 53.97 (3.02) 77.70 (2.85) 83.75 (1.96) 42.30 (1.53) 66.80 (1.78) 74.39 (1.45) CD+CA 57.24 (3.09) 79.74 (2.37) 84.99 (1.87) 44.39 (1.17) 67.88 (1.16) 75.31 (1.10) CD+DND 56.12 (4.07) 79.16 (2.57) 84.66 (1.84) 43.79 (2.07) 68.18 (1.55) 75.38 (1.60) CD+CA+DND 56.56 (2.07) 80.13 (1.07) 85.49 (1.09) 44.22 (1.82) 69.21 (2.05) 76.03 (1.60) RoBerta-based DND 45.21 (3.07) 68.10 (2.80) 77.01 (1.76) 37.66 (1.41) 61.41 (2.30) 70.44 (1.50) CA+DND 46.76 (1.89) 68.73 (1.22) 77.65 (0.90) 39.43 (1.67) 62.87 (2.50) 71.70 (1.83) CD 60.06 (3.01) 81.98 (1.75) 86.54 (1.31) 46.57 (1.60) 69.26 (1.69) 76.17 (1.22) CD+CA 60.01 (2.23) 80.23 (2.11) 85.85 (1.41) 46.96 (2.10) 68.59 (1.93) 76.05 (1.14) CD+DND 62.54 (3.30) 82.56 (1.24) 87.57 (1.18) 47.69 (2.52) 69.98 (1.96) 76.71 (1.55) CD+CA+DND 63.57 (3.44) 82.76 (2.47) 87.55 (1.58) 48.72 (2.03) 70.03 (1.63) 77.14 (1.38) Table 5.1: Performance on the partner modeling task, showing the utility of the proposed methods. EMA and Top-1 represent the accuracy in percentage. We also scaled NDCG@3 to 0-100. For all the metrics, higher is better. The numbers represent Mean (Std.) over 5-cross folds of the CD data. 5.4 Results and Discussion 5.4.1 RQ1: Are Data Augmentations Useful? We summarize the results in Table 5.1. Our proposed ranking-based models beat the Random and BoW-Ranker baselines by a huge margin across all metrics. This is true even for zero-shot 86 1 10 25 50 75 100 Percentage of CD Data 20 30 40 50 60 Exact Match Accuracy (EMA) CD CD + CA + DND (a) 1 2 3 4 5 Number of Utterances Seen 35 40 45 50 55 60 65 Exact Match Accuracy (EMA) CD CD + CA + DND (b) Figure 5.3: Mean performance for two RoBERTa-based models: (a) on different percentages of CD data. The Y-Axis represents EMA at k=5, (b) on different values of . CA Model Accuracy Random 52.4 (4.14) AD 63.8 (9.33) AD+DND 73.4 (6.19) CD+AD 78.9 (1.39) CD+AD+DND 76.7 (3.52) (a) DND Model EMA Random 16.04 (0.92) DND 60.68 (2.05) AD+DND 60.9 (1.87) CD+DND 63.11 (1.77) CD+AD+DND 63.56 (0.94) (b) Table 5.2: Performance for RoBERTa-based models: (a) argument classification accuracy on the validation set of CA, (b) EMA at =2 for partner modeling on the validation set of DND. The numbers represent Mean (Std.) over 5-cross folds. DND and for CA + DND, attesting to the utility of the proposed ranking methodology and data adaptations.5 Comparing similar configurations, we observe that RoBERTa-based models outperform BERT-based models on this task. The best-performing configuration is the RoBERTa CD + CA + DND that combines all three data sources. In Figure 5.3a, we plot the performance for different percentages of CD data. We only show RoBERTa-based models due to their superior performance. The plot highlights the advantage of 5Training with the CA data only was not useful due to the lack of training with any partial dialogues. 87 adapting the related data sources, especially in few-shot settings, with CD + CA + DND at 50% matching the performance of CD at 100%. We also look at how the performance varies with the number of utterances seen in Figure 5.3b. We find that the performance gains are visible across all values of . The data augmentations allow the model to perform better than the baselines while observing a fewer number of utterances, making the model more useful in realistic scenarios. Performance on the adapted datasets: We analyze if our joint learning also improves the performance on the validation sets of CA and DND datasets, showing advantages across multiple tasks. For CA dataset, we measure argument ranking accuracy: for a given input dialogue based on a pair of arguments, we consider a prediction as correct if the scores predicted by the model correctly rank the arguments. For DND, we analyze EMA at k=2 for partner modeling, similar to our setup for CaSiNo. As evident from Tables 5.2a and 5.2b, we find support that joint learning improves the performance on CA and DND datasets as well. 5.4.2 RQ2: Do Preferences and Offers Help? Average attention: We recognize the utterances with preference statements by utilizing strategy annotations in CaSiNo (Chawla et al., 2021b). We assume that an utterance contains a preference if it was annotated with at least one of Self-Need, Other-Need, or No-Need strategies. For identifying offers, we use regular expressions following prior work (He et al., 2018) (refer Appendix C.2.2). We consider any utterance that is not labeled with a preference or an offer as Other. Then, we observed the average attention put by the best-performing model on these categories in the Level II encoder. Preferences received an average of 0.3, offers received 0.27, and other utterances 88 received 0.08 attention scores without any explicit indication about these categories during model training. We consider this as preliminary evidence that the learning process matches our intuition, with preferences and offers contributing to the performance. Performance across integrative potential: For more concrete evidence of the utility of preferences and offers, we look at how the performance varies between scenarios with low and high integrative potential. This basically captures how aligned the preferences of the two negotiators are in a negotiation. In a scenario with low integrative potential, the negotiations are more competitive, leading to a higher expression of preferences and offers and providing a better signal to our ranking models. For our best-performing model, we find EMA at =5 to be 68.75 (4.58) for scenarios with low integrative potential against 60.31 (2.67) for those with high potential. This provides stronger evidence that the learning process sensibly takes into account the preference and offer statements in the data. 5.4.3 RQ3: Comparison to a Human Expert Similar to the trained models, we asked a human expert (an author of this work) to guess the priority order of the partner by accessing partial dialogues. The expert was allowed to make multiple guesses if she is unsure, in which case the final ranking was chosen randomly from all the guesses. We compare the expert to our best-performing model on 100 dialogues from the evaluation set. The expert achieved 75% mean EMA at =5 against 66% for the model while performing better on other metrics as well. We show the comparison by varying the parameter in Appendix C.3. 89 While the model performs reasonably, there is a scope for improvement. We performed a qualitative analysis of the errors made by the model and the expert. In many cases, it is simply not feasible to predict accurately, especially when negotiators engage in small talk early on - indicating a limited scope for improvement with fewer utterances. In some cases, there is more focus on the highest priority issue, giving less explicit signals of the entire ranking. This might work for some applications, but in other cases, the agent design can be modified to discuss the complete ranking more explicitly. Integrating other datasets that follow the same MIBT structure (DeVault et al., 2015) via data adaptation or multi-task learning is another potential direction. We also observed errors in the cases that included longer contextually dense utterances, where preferences are shared indirectly as a response to the partner, and when the negotiators give away their higher priority issues out of empathy towards their partner. These cases are easier for the expert but can be confusing to the model. Better modeling of the prior context and handling of longer utterances are also avenues for improvements in the future. 5.5 Conclusion We presented and evaluated a transformer-based approach for partner modeling in mixed-motive negotiation dialogues. Our objective was to address the challenges to bridge the gap between existing research and practical applications of partner modeling techniques. Our comparison to baselines and ablations attests to the utility of our method. We found that the proposed data adaptations can be especially beneficial in 0-shot and few-shot scenarios. In the next Chapter, we design dialogue models for mixed-motive interactions. Our investigation in these past few Chapters shows that various complex behaviors are prevalent in realistic 90 mixed-motive interactions. Successful participants attempt to persuade their partners with personal motivations, employ a range of linguistic strategies, and use appropriate emotion expression. Alongside, they must also interpret these behaviors in their partner’s utterances, understand their partner’s preferences and take them into account for future interactions. This makes designing automated systems that effectively take part in scenarios like CaSiNo a challenging undertaking. Hence, we start our exploration by focusing on DealOrNoDeal, a simpler scenario that still captures the fundamental intricate characteristics of mixed-motive interactions – demanding a balance between conflict and mutual dependence. Achieving this balance would be the objective of our exploration, as we build on the prior work on this dataset. Later (in Chapter 7), we will explore ways in which Large Language Models (LLMs) can help us extend our models to other settings. 91 Chapter 6 Be Selfish, But Wisely: Designing Mixed-Motive Dialogue Models "Firms, in the pursuit of profits, are led, as if by an invisible hand, to do what is best for the world." – Adam Smith: The Father of Modern Economics Prior work has explored Self-play Reinforcement Learning (RL) to design dialogue systems for negotiations (Lewis et al., 2017; He et al., 2018). The typical pipeline consists of the following three steps. Step I: Train a model that imitates human-human dialogue data in a supervised manner. Step II: Create two copies of , , which is the initialization for the RL agent, and , which acts as a fixed simulated user. Step III: Update to maximize its performance using an online RL algorithm by making it interact with (bot-bot interactions) and recording the final performance achieved by the model (the reward). Although used in prior work, we argue that this procedure leads to a fundamentally flawed system that fails to learn the value of compromise in a negotiation. Arguments: 1) The available human-human negotiation data mainly contains dialogues that end in agreements (≈ 80% in DealOrNoDeal dataset (Lewis et al., 2017)), instead of walkaways or no agreements, leading to a 92 highly prosocial simulated user that tends to show agreement, regardless of how favorable the deal is. Hence, when training the RL agent to maximize its own performance against , becomes highly self-interested without learning to make any concessions since that leads to a high reward for . We show one such problematic conversation between these two models in Table 6.1. 2) Another piece of evidence comes from prior work (Lewis et al., 2017). Even though such an RL model seems to perform well in automated evaluations (against the simulated user), it performs much worse against human partners, who often prefer to walk away with no agreement and 0 points earned for both parties rather than agreeing to an uncompromising partner. 3) Finally, one can look at what happens if is made to play with another copy of . In this case, we find that the agents simply get stuck - both continuously asking what they want without looking for a compromise (refer to Appendix D.1 for a sample conversation). Context (Alice: RL-Based, Bob: Supervised) Counts Book = 2, Hat = 1, Ball = 3 Alice Values Book = 1, Hat = 2, Ball = 2 Bob Values Book = 0, Hat = 7, Ball = 1 Dialogue Alice i would like the balls and hat and a book Bob you can have the balls and one book Alice i will take the balls and hat Bob deal Alice <dealselection> Output Alice Book = 0, Hat = 1, Ball = 3 Bob Book = 2, Hat = 0, Ball = 0 Reward Alice 8/10 Bob 0/10 Table 6.1: A sample problematic negotiation dialogue between the standard RL agent (Alice) and a supervised model (Bob), based on Lewis et al. (2017). The task here is to divide the available books, hats, and balls between the two players. In this case, Bob accepts a deal even though it is very unfavorable, resulting in a high score for Alice. 93 This failure hurts the practical utility of the system, both from the perspective of being a successful negotiator in conversational AI use cases and for providing social skills training in pedagogy. The key challenge here is to somehow teach the model to be a mixed-motive negotiator instead of only self-interested, with a better understanding of the concept of walkaways in a negotiation, even though the collected dialogue data primarily consists of dialogues ending in agreements. To address this, we investigate two modifications to the training procedure (Section 6.1), resulting in systems that exhibit diverse personalities1 : 1) We vary the RL reward directly so that the model is forced to take the partner’s interests into account. This corresponds to manipulating the motives of the dialogue agent, a psychological concept that has received significant attention in the literature (Murphy and Ackermann, 2014). For this purpose, we rely on a measure of utility from negotiation theory in Economics (Fehr and Schmidt, 1999), which helps us to control selfish vs. fair behavior explicitly. 2) We vary the personality of the simulated user that the RL agent is trained with. This approach essentially manipulates the interaction experience that the agent receives so that the agent is itself allowed to discover the value of making concessions by being better exposed to walkaways during training. Through a comprehensive automated and human evaluation, we investigate what model variation allows for superior performance. Our key finding is that a selfish agent, which maximizes its own performance while also avoiding walkaways, achieves superior performance to other variants by learning to generate value for both itself and the negotiation partner (Section 6.3). 1By personality, we simply refer to the consistent behavior portrayed by the trained agent (https://www.apa. org/topics/personality) 94 6.1 Methodology Our focus is on negotiations based on MIBT (introduced in Chapter 2). Recall that in MIBT, each issue has a predefined quantity along with a random value (potentially different) assigned for every player. The players engage in a dialogue to reach an agreement – a possible division of all the available items in which they try to maximize the total value of the items that they get. Our goal here is to develop techniques so that the trained dialogue models learn to make concessions (e.g., by offering deals that help the partner) for their partners apart from just learning to extract concessions from them. As discussed earlier, this mixed-motive behavior is a fundamental expectation from a practical negotiation dialogue system. To achieve this, we propose two complementary techniques – first, where we explicitly incorporate the partner’s performance into the reward function of the RL agent, and second, where the model implicitly learns to make concessions by interacting with a specific partner during training. We start by describing our base RL framework and then discuss the two proposed techniques. 6.1.1 Self-Play RL for Dialogue We use the Selfplay RL framework introduced by Lewis et al. (2017) for training negotiation dialogue systems. Their pipeline consists of first training a supervised agent to mimic the collected human-human dialogue data and then using selfplay RL to further optimize the model. As Lewis et al. (2017) note, training a supervised agent to mimic human actions is a scalable and domainagnostic starting point. However, this model by itself is unable to engage in strategic actions necessary for effective negotiation. By then having the supervised model negotiate with a fixed copy of itself (simulated user) and fine-tuning the model using an online RL algorithm, the model 95 can be optimized towards a given reward function (in this case, the points scored by the agent in the negotiation). The framework relies on a sequence-to-sequence model based on an ensemble of Gated Recurrent Units or GRUs (Cho et al., 2014). The model consists of one unidirectional GRU for encoding the input goals of the agent, another to encode the utterances from both the agent and the human partner, and one bidirectional GRU to generate the output deal once the negotiation is over.2 In the supervised stage, the model is trained on a combined cross-entropy loss that jointly optimizes both the next-token prediction and the output deal prediction. The RL agent is trained with the REINFORCE method (Williams, 1992). 6.1.2 Proposed Techniques 6.1.2.1 Varying the Reward Function The key idea here is to incorporate the partner’s performance into the reward function used for training the RL agent. Intuitively, this would make the agent more prone to offering deals or accepting deals that help the partner as well. To approach this systematically, we leverage a measure of utility defined in negotiation theory in Economics by Fehr and Schmidt (1999). The utility function () is defined as follows: () = − ∗ ( (0, − )) − ∗ ( (0, − )) (6.1) 2Although the exact choice of the model architecture is irrelevant to our analysis, we choose this lightweight architecture to enable our analysis with different kinds of agent personalities. 96 where ≤ , 0 ≤ < 1. and denote the two players in the negotiation. = ( , ) denotes the points scored by the corresponding players. () essentially captures the utility gained by the player from the negotiation, given the points scored by all the players (). Fehr and Schmidt (1999) defined this utility measure to model diverse behaviors in humanhuman negotiations, noting that merely assuming that all players are selfish does not explain the data. Hence, to capture the diversity in human behaviors, the equation includes additional terms that capture the advantage and the disadvantage of player with respect to player in the negotiation. We repurpose this utility measure directly as the reward for the RL agent. By varying the coefficients and , different reward functions that promote diverse personality behaviors can be generated. We demonstrate this in Table 6.2. For our analysis in this work, we choose the selfish and fair configurations. a b Utility (Ui(x)) Interpretation 0 0 Selfish: partner points don’t matter. 1 0 − ( (0, − ) Doesn’t like if the partner outperforms. 0 -1 + ( (0, − ) Selfish and Envious (desires poor partner performance) 0.75 0.75 − 0.75 ∗ (0, − ) − 0.75 ∗ ( (0, − ) Fair: Doesn’t like if the partner performs worse or better Table 6.2: Demonstration of reflected personalities by varying the parameters and from Equation 6.1. The variants used in this work are highlighted in blue. 6.1.2.2 Varying the RL Partner While the above method, in some ways, explicitly pushes the agent to take the partner’s performance into account, we now propose another technique to achieve this more implicitly. Since the supervised model tends to show socialistic behaviors (Table 6.1), the RL agent fails to explore scenarios that do not lead to an agreement and, hence, cannot capture the notion of 97 walkaways in the learned policy. However, if the agent were to interact with an uncompromising partner, this could be leveraged to simulate “walkaways” during model training, with the hope that the model discovers ways to avoid disagreements (while still optimizing on the reward), and thus implicitly learns about making concessions for the partner. Hence, the key idea here is to vary the personality of the partner model. In addition, we define a length cut-off : if the conversation reaches utterances, this is seen as a disagreement, and both agents receive 0 points from the negotiation. We explain how we design the diverse partner personalities for training later in Section 6.2. 6.2 Experimental Design We proposed two ways of training dialogue models that capture the mixed-motive nature of negotiations: 1) explicitly, by varying the reward function for the RL algorithm (Section 6.1.2.1), and 2) implicitly, by varying the partner with which the RL model is trained (Section 6.1.2.2). The primary research question we aim to answer is what variation leads to superior performance with human partners. We first describe the dataset and the study design, followed by results in Section 6.3. Dataset: We use the DealOrNoDeal dataset (Lewis et al., 2017), which is based on the Multi-Issue Bargaining Task (Fershtman, 1990) design. The dataset uses a simplistic design involving 3 issues (books, hats, and balls) and has been a popular choice for research in negotiation dialogue systems. It comprises 5808 dialogues in English based on 2236 unique scenarios, where a scenario refers to the available items up for grabs and their corresponding values for the two players. In each scenario, there is a fixed quantity of each issue, and players are randomly assigned a point value 98 before the negotiation for each of the 3 issues. The goal of the dialogue is to reach an agreement on the possible division of all the available items, where each player strives to maximize the total value of the items that they get. The maximum possible value for a player is 10. However, if no agreement is reached, then both players end up with 0 points. Nearly 80% of the dialogues end in agreement, with an average of 6.6 turns per dialogue and 7.6 words per turn. We use the same splits as the original dataset paper to train our dialogue agents. Figure 6.1: The three-stage process used to design the 6 dialogue agents for our 2 x 3 study. : Reward that the RL agent is trained to maximize. : The partner with which the RL agent is trained. = corresponds to the model trained in Stage 1, while =selfish and =fair correspond to the respective models trained in Stage 2. Study Design: We design a 2 X 3 study based on the strategies described in Section 6.1. We use a three-stage process to develop the 6 agent personalities: Stage 1: Develop a supervised likelihood model, following Lewis et al. (2017). Stage 2: Train two RL dialogue agents by varying the reward using the selfish and fair utility functions selected from Table 6.2. Note that the selfish configuration here is equivalent to the base RL model trained by Lewis et al. (2017). Stage 3: Train the remaining four RL agents by varying the reward function (selfish vs. fair) and using 99 either of the two models trained in Stage 2 as partners. We provide an overview of this process and describe our notations in Figure 6.13 . Hyperparameters: We borrowed the hyperparameters from Lewis et al. (2017) and refer the readers to that paper for full details. The supervised model is trained for 30 epochs with a batch size of 16 using stochastic gradient descent. The initial learning rate is kept as 1.0, clipping gradients with 2 norm exceeding 0.5. This was followed by annealing of the learning rate by a factor of 5 per epoch. All the dialogue agents used in the experiments are initialized from this supervised model and trained for nearly 16k agent-agent interactions with the partner model, using a learning rate of 0.1 and a discount factor of =0.95. We use a length cut-off of 20 utterances to simulate walkaways: if a dialogue reaches 20 utterances, this is seen as a disagreement, and both players end up with 0 points. Human Evaluation: We performed a human evaluation on the Prolific4 crowdsourcing platform. We collected nearly 100 agent-human conversations for each of the 6 dialogue models, where one human worker was allowed to participate only once. The workers were paid a base payment for their time, along with a lottery-based bonus that was dependent on their performance and effort. We provide more details in Appendix D.2, including statistics, worker qualifications, payments, and the design of the user interface. 6.3 Results Table 6.3 summarizes the human evaluation results. We analyze 3 key metrics: the points scored by the human, by the agent, and the total joint points – an indicator of the total value created in 3Our implementation is based on https://github.com/facebookresearch/end-to-end-negotiator. 4https://www.prolific.co/ 100 the negotiation. We also report the %age of walkways (%age of dialogues that do not reach an agreement). We discuss the significant trends below. Model Points Scored (Including walkways) ↑ Points Scored (Excluding walkways) ↑ Walkaways ↓ Human Agent Joint Human Agent Joint (in %) M p=S r=fair 5.72 (0.29) 5.99 (0.29) 11.71 (0.43) 6.03 (0.28) 6.32 (0.26) 12.35 (0.34) 5.15 M p=fair r=fair 5.87 (0.29) 6.04 (0.28) 11.91 (0.43) 6.24 (0.26) 6.43 (0.25) 12.67 (0.33) 6.00 M p=selfish r=fair 5.59 (0.31) 5.80 (0.32) 11.39 (0.42) 5.89 (0.30) 6.12 (0.30) 12.01 (0.34) 5.15 M p=S r=selfish 4.70 (0.32) 5.58 (0.39) 10.28 (0.61) 5.86 (0.27) 6.96 (0.33) 12.82 (0.38) 19.79 M p=fair r=selfish 4.59 (0.35) 5.20 (0.42) 9.79 (0.67) 6.07 (0.29) 6.88 (0.37) 12.96 (0.41) 24.44 M p=selfish r=selfish 6.18 (0.30) 5.90 (0.28) 12.09 (0.48) 6.85 (0.25) 6.54 (0.23) 13.39 (0.31) 9.71 Table 6.3: Results from the human evaluation study. We report the Mean (Standard Error) wherever applicable. The Joint points are scored by computing the mean over the sum of the points scored by both players – an indicator of the joint value created in the negotiation. The maximum possible points for a player in a negotiation is 10. ↑: Higher is better, ↓: Lower is better. In each column, we highlight the worst and the best scores in red and blue respectively. We discuss the significant trends in Sections 6.3 and 6.4. To analyze the overall performance, we conducted 2 (reward : selfish vs. fair) x 3 (partner : supervised vs. selfish vs. fair) ANOVAs on the points earned in the negotiation. First, we found no significant differences in the points earned by the dialogue agents. However, the agent reward significantly affected human points (F(1, 577) = 5.00, p = .03), such that human partners playing with fair agents (=fair) earned more points (M = 5.73; SE = 0.18) than those playing with selfish ones (M = 5.16; SE = 0.18). There was also a main effect of the partner (F(2, 577) = 3.09, p = .046), but both of these main effects were qualified by a significant interaction (F(2, 577) = 5.40, p = .005). Consequently, this led to similar significant trends in the joint points earned (F(1, 577) = 5.21, p = .02), such that fair agents (=fair) earned more joint points with their partner (M = 11.67; SE = 0.29) than selfish ones (M = 10.72; SE = 0.29). 101 Interestingly, human partners earned more points with =selfish =selfish agent compared to other selfish agents, which also led to more joint points, bringing it on par with (or even better than) fair agents. A plausible explanation is that since the =selfish =selfish agent is trained with an uncompromising partner (unlike other agents with =selfish), it is better exposed to the repercussions of not making concessions for the partner since the agent receives a 0 reward if there is no agreement (within 20 utterances). Thus, the agent learns to “give in” in order to avoid no agreements. Next, we test this explicitly by analyzing the %age of walkaways for each agent. %age of walkaways: Indeed, a log-linear analysis reveals a significant interaction between the agent’s reward and partner on the %age of walkaways ( 2 = 31.1, p < .0001). Specifically, the effect of partner only appears among the selfish agents (=selfish), as revealed by a Chi-squared test ( 2 (2)= 7.63, p = .02), where only 9.7% of conversations with =selfish =selfish agent end without an agreement, much lower than with other selfish agents. This effect was not significant among the agents trained with a fair reward, in which case, the %age of walkaways is observed to be low, regardless of the partner . Removing walkaways: Once the instances that end up in walkaways are removed, we find that selfish agents (=selfish) earn more points for themselves (M = 6.79; SE = 0.17) than fair agents (M = 6.28; SE = 0.16; F(1, 510) = 4.62, p = .03). This means that the lack of significant effects above in agent points was due to walkaway instances that result in 0 points for the agent. Further, we note that even when walkaways are removed, the human partners earn more points with =selfish =selfish agent than with other selfish agents. We observed similar trends for joint points as well, with maximum joint points for the =selfish =selfish agent. This suggests that besides contributing to lesser walkaways, =selfish =selfish agent further learns to discover creative solutions that help both the players. 102 Context Counts Book = 1, Hat = 3, Ball = 1 Model Values Book = 2, Hat = 1, Ball = 5 Human Values Book = 10, Hat = 0, Ball = 0 Dialogue Human I would like the books Model can i have the ball and the book and you can have the hats Human you can have the ball and let’s split the books Model how about i get the ball and two hats and you get the rest ? Human i am ok with that Model <dealselection> Output Model Book = 0, Hat = 2, Ball = 1 Human Book = 1, Hat = 1, Ball = 0 Reward Model 7/10 Human 10/10 Table 6.4: Example conversation between the =selfish =selfish agent and a human partner in our experimental study. The agents helps to find a solution that leads to high performance for both players. We show one such example in Table 6.4 and provide more examples from the human evaluation in Appendix D.3. 6.4 Discussion Going beyond the typical reward formulations used in the literature, this is the first instance of leveraging prior Economics theories to explicitly incorporate the partner performance within the reward of the selfplay RL negotiation agent. Our formulation provides a systematic and general way to train mixed-motive agents with diverse personalities (Table 6.2). As shown in Figure 6.1, our multi-stage training process provides an automated way to simulate diverse partner behaviors 103 as well, instead of the unscalable rule-based approaches followed in prior work (for instance, the price-based rules defined for buyer-seller negotiations in Yang et al. (2021)). The overall points scored in Table 6.3 show that all fair agents (=fair) and the =selfish =selfish agent perform superior to the =S =selfish agent, which is trained following the standard procedure used in prior work – in terms of the human points, agent points, and (consequently) the joint points. This suggests that both strategies of varying the reward and varying the partner during RL training show promise for teaching the mixed-motive nature of negotiations to the dialogue agents. We especially note the superior performance of =selfish =selfish agent. Trained with a simplistic reward that maximizes its own performance, =selfish =selfish learns to make concessions implicitly by being better exposed to the repercussions of not doing so during training. This observation aligns with the philosophy of the ‘Invisible Hand’ in Economics by Adam Smith (Grampp, 2000), which suggests that self-interested players are implicitly led (as if by an invisible hand) to cooperate and take actions that benefit others. M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Alice Model Variant M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Bob Model Variant 6.65 7.57 6.39 8.67 8.36 6.85 5.9 6.3 6.11 7.64 6.43 6.58 6.23 7.16 6.35 5.48 6.91 6.72 4.72 4.64 4.63 1.07 2.2 5.68 4.7 4.68 4.64 3.56 0.854 5.49 6.87 6.88 6.65 5.87 5.91 5.88 Mean Alice Points (Overall) 1 2 3 4 5 6 7 8 M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Alice Model Variant M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Bob Model Variant 6.81 7.62 6.41 8.78 8.51 6.99 5.94 6.76 6.13 8.48 7.81 6.85 6.25 7.19 6.44 7.17 8.05 6.89 4.78 5.15 6.05 6.23 5.13 6.98 4.78 5.68 5.41 8.31 6.69 6.79 7.01 7.17 6.83 7.21 7.31 7.29 Mean Alice Points (Excluding disgreements) 5 6 7 8 Figure 6.2: Heatmaps depicting the results from 388 agent-agent interactions. Each cell denotes the points scored (out of 10) by the Alice variant (X-Axis) when it interacts with the corresponding Bob model (Y-Axis). 104 M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Alice Model Variant M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Bob Model Variant 13.1 13.5 12.6 13.4 13.1 13.7 13.5 12.5 13.3 12.3 11.1 13.5 12.6 13.3 12.7 10.1 11.5 13.4 13.4 12.3 10.1 2.17 5.75 11.5 13.1 11.1 11.5 5.75 1.74 11.4 13.7 13.5 13.4 11.5 11.4 11.6 Mean Joint Points (Overall) 2 4 6 8 10 12 M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Alice Model Variant M p = S r = fair M p = fair r = fair M p = selfish r = fair M p = S r = selfish M p = fair r = selfish M p = selfish r = selfish Bob Model Variant 13.4 13.6 12.7 13.6 13.3 14 13.6 13.4 13.3 13.6 13.5 14 12.7 13.3 12.9 13.2 13.5 13.7 13.6 13.6 13.2 12.7 13.4 14.2 13.3 13.5 13.5 13.4 13.6 14.1 14 14 13.7 14.2 14.1 14.3 Mean Joint Points (Excluding disgreements) 12.75 13.00 13.25 13.50 13.75 14.00 14.25 Figure 6.3: Heatmaps depicting the results from 388 agent-agent interaction. Each cell denotes the mean joint points scored by the corresponding Alice model variant (X-Axis) and the Bob variant (Y-Axis). 6.4.1 Automated Evaluation To gain additional insights into the behavioral diversity and the performance of the dialogue agents, we analyze the results from the agent-agent interactions. For this purpose, we gather 388 conversations for every pair of agents and observe the average points scored by both agents separately and jointly. We depict the agent performance using heatmaps in Figure 6.2. Selfinterested agents that are less exposed to walkaways during training ( = =selfish and =fair =selfish) tend to exploit the agents trained with a fair reward. However, this behavior backfires when the partner model behaves similarly in a self-interested manner – both agents show uncompromising behavior that leads to higher disagreements (stuck in negotiation for >= 20 utterances) and, ultimately, extremely low overall scores. In general, we find the =selfish =selfish agent to be superior, consistently achieving a high performance for itself (the last column) while also enabling a high performance for its partners (the last row). This trend is also evident from the corresponding heatmaps for joint points shown in Figure 6.3. 105 Fair Selfish Reward 0 1 2 3 4 5 Satisfaction 4.33 3.489 4.227 3.961 4.237 3.632 Satisfaction RL Partner fair selfish supervised Fair Selfish Reward 0 1 2 3 4 5 Likeness 3.79 3.011 3.866 3.495 3.866 2.927 Likeness RL Partner fair selfish supervised Figure 6.4: Subjective assessment by humans. Both metrics are measured on a scale of 1 to 5. 6.4.2 Subjective Assessment Prior work has argued the importance of incorporating subjective measures (Aydoğan et al., 2020). Although this is more relevant for repeated interactions between the same players (unlike in our case, which only involves one negotiation between an agent and a human partner), nevertheless, we present results on the subjective assessment of the human partners for completeness. Through a post-survey, we measured the human partners’ satisfaction with the outcome and likeness towards the agent on a five-point scale (more details in Appendix D.2). We summarize the results in Figure 6.4. Based on 2 x 3 ANOVAs, we find that human partners of the fair agents (=fair) were significantly more satisfied (F(1, 576) = 47.32, p < .0001) as compared to the humans who interacted with the selfish ones, but this was qualified by a marginally significant interaction with the partner (F(2, 576) = 2.54, p = .08). This can be attributed to the previously noted observation that human partners, on average, secured more points with fair agents. We find similar trends with likeness towards the agent as well – human partners report higher likeness when playing with fair agents as compared to selfish ones (F(1, 577) = 53.95, p < .0001). 106 Interestingly, among the selfish agents (=selfish), the =selfish =selfish achieved the highest subjective assessment from the human partners, bringing it close to the performance of fair agents, even though it was trained with a selfish reward. 6.4.3 Measuring Success As discussed in prior work (Chawla et al., 2023b), our analysis reflects upon the multi-faceted nature of the notion of success in negotiations, where observing a single dimension can be misleading. For example, when interacting with model , the = =selfish agent seems to get high points for itself. However, our analysis shows that this is simply due to fewer walkaways, which occur far more often with other selfish agents or human partners. Thus, we stress the importance of a comprehensive evaluation of negotiation dialogue systems. Perhaps the downstream application context can guide what metrics should be prioritized. From a pedagogical perspective, training agents that accurately reflect the diversity in human behavior (as in this work based on Equation 6.1) can itself be highly valuable for social skills training. Similarly, subjective assessment of the dialogue agents can be more important in scenarios involving relationships for long-term or repeated social influence interactions. If the goal is to design a dialogue agent that performs the best for itself (regardless of partner performance), such as in a game context, perhaps the best strategy is to train it with a variety of partner personalities. The agent must develop a theory-of-mind about the partner and learn to weigh extracting concessions vs. making concessions based on the personality of the specific partner in the interaction. We attempted to train such an agent, but unfortunately, not keeping the partner model fixed makes the training process unstable (also observed in Lewis et al. (2017)). 107 One explanation for this is the relatively short conversations in DealOrNoDeal, which makes it hard to infer the partner’s personality implicitly. Hence, there is value in extending our analysis to other negotiation dialogue datasets (Yamaguchi et al., 2021; Chawla et al., 2021b). 6.5 Conclusion We devised two variations of the standard self-play RL technique to inculcate the mixed-motive nature of negotiation into the dialogue agents. The first approach worked by varying the reward function and thereby, by explicitly pushing the model to take the partner’s performance into account. In the second approach, we modified the personality of the partner agent during training, which allowed the RL agent to discover the mixed-motive nature of the task implicitly. We find that both techniques hold promise, with an especially strong performance from the agent that is trained with a selfish reward and a self-interested partner. This agent not only improves on the agreement rate but also learns to discover offers that create value for its partner without hurting its own points significantly. The RL models trained in this work function directly at the level of individual utterances (inspired from prior work) – that is, one action for the RL agent is to generate the entire next utterance, and one episode is the complete dialogue. Once the negotiation ends, the RL agent receives the reward which is used for training in the self-play RL stage. Unfortunately, it is non-trivial to use this approach as-it-is for scenarios with longer conversations and more dense utterances since this would lead to extremely sparse rewards that are only available at the end of the dialogue. This makes directly employing RL training infeasible. Hence, in the next Chapter, we explore a different approach. Piggybacking on the recent advances in NLP, we investigate how 108 Large Language Models can support research in mixed-motive interactions. Later, in Chapter 9, we will discuss ways in which one can combine LLMs with carefully designed RL policies. 109 Chapter 7 Evaluating Large Language Models in Mixed-Motive Contexts P1(Self)’s Item Values Food(3), Water(4), Firewood(5) Negotiation for Camping Items START DURING END P2 C1: Comprehending Negotiation Context C1: Comprehending Own Context C1: Comprehending Final Deal Outcome C3: Inferring Partner’s Satisfaction Agreement I need a lot of food !! Because I have large family [DA: Self-need] Do you have any specific items you need? [DA: Inquiry] C2: Annotating Dialog-Act (DA) Then, how about you take three foods and I take the rest? [DA: Proposal] C3: Inferring Partner Preference C4: Generating Next Response Counts for Items : Food(3), Water(3), Firewood(3) Food(5), Water(4), Firewood(3) P2(Partner)’s Item Values Not visible to P1 … P1 P1 Figure 7.1: Overview of the key capabilities (C1-C4) required for a successful negotiation. We design tasks aligned with these abilities to assess how LLMs can advance different aspects of negotiation research. The negotiation scenario is based on Chawla et al. (2021b). Taking an example from the CaSiNo dataset, Figure 7.1 summarizes the four core capabilities that are essential for success in a typical bilateral multi-issue negotiation. To achieve success, a 110 negotiation agent must have a deep comprehension of its initial context as well as an understanding of how the negotiation concluded (C1). During the negotiation, it must also understand the structure (e.g., intents and slots) and the semantics (e.g., employed linguistic strategies) of the utterances (C2). Notice that the partner’s context is not visible to the agent at the start of the conversation – it is only through the dialogue that the agent must infer its partner’s preferences so as to come to an agreement with them (C3). Finally, the agent must be able to generate a coherent and strategically appropriate response conditioned on the input context and conversation history (C4). These complexities make negotiation a challenging testbed for AI research. Large Language Models (LLMs) have demonstrated remarkable performance in various text comprehension and reasoning tasks (Brown et al., 2020; Kojima et al., 2022). They have also shown exceptional capabilities as dialogue systems, accurately understanding dynamically changing conversation contexts and generating human-like responses (OpenAI, 2022; Chiang et al., 2023). Consequently, it is only natural to explore how LLMs can be useful in negotiations as well – for designing end-to-end (Lewis et al., 2017) and modular negotiation dialogue systems (He et al., 2018), for providing pedagogical feedback through ToM (Johnson et al., 2019a), and for scaling up data collection practices by automating the collection of utterance-level annotations (Yamaguchi et al., 2021). There exists prior work that explores LLMs for negotiation tasks (Fu et al., 2023; Gandhi et al., 2023; Abdelnabi et al., 2023). Fu et al. (2023) designed an LLM-based agent for balloon price negotiations via self-play and feedback from a critic-LLM. Gandhi et al. (2023) used LLMs to improve strategic reasoning based on few-shot CoT prompting and demonstrations about states, values, and beliefs. Abdelnabi et al. (2023) focused on interactive multi-agent games, showing that agents can consistently reach successful deals through systematic zero-shot CoT prompting. 111 While these studies employed LLMs for designing end-to-end agents, a systematic exploration of their diverse abilities is missing, inhibiting their use in negotiation-related use cases. To this end, we design a novel methodology to conduct a systematic analysis of state-of-theart LLMs in negotiations based on 35 tasks developed on 4 dialogue datasets (Section 7.1). Our methodology leverages 3 crucial axes to define these tasks in a principled manner: 1) Objectivity (Objective vs. Subjective tasks), 2) Time Stage of the negotiation (Start, During, and End), and 3) Task Type (Comprehension, Annotation, Partner Modeling, and Generation), aligned with the core competencies necessary for successful negotiations (Figure 7.1). Although our framework is designed with the goal of supporting negotiation research, the resulting methodology captures the nuances common in other dialogue tasks as well. Our overall trends show the superiority of GPT-4, finding that it often outperforms even the Flan-T5 baseline which is specifically fine-tuned for the corresponding task (Section 7.3). Additionally, all models show poor correlations with human judgments on subjective assessments about the negotiation. Through a human evaluation, we find that GPT-4 performs comparable to an average crowdsourcing worker in response generation. We perform an error analysis, discussing the coherency and strategic failures that still persist in the compared LLMs (Section 7.3.4). 7.1 Methodology We evaluate the multifaceted capabilities of LLMs in negotiations with a three-step framework (Figure 7.2): 1) Designing tasks from human-human dialogue datasets, 2) Constructing prompt templates, and 3) Performing evaluations and analysis across various proprietary and open-source LLMs. 112 Figure 7.2: Our methodology for systematically evaluating LLMs in negotiation dialogues. Part A (top) describes the pipeline for creating task-specific prompts from a negotiation dataset and evaluating various LLMs with them. Part B (bottom) depicts the tasks categorized by Objectivity, Time Stage, and Task Type (Section 7.1.2). 7.1.1 Datasets Three key requirements guide our dataset selections: 1) Well-structured design with concrete player preferences and outcomes, 2) Diversity in the task scenario, and 3) Availability of metadata for testing holistic LLM abilities. Given this criteria, we selected 4 datasets: CRA (DeVault et al., 2015), DND (Lewis et al., 2017), CA (Chawla et al., 2021b), and JI (Yamaguchi et al., 2021). Table 7.1 provides an overview of these datasets. While each dataset captures a distinct scenario, they are all based on a tractable closed-domain abstraction from the literature, referred to as Multi-Issue Bargaining Task or MIBT (Fershtman, 1990). MIBT involves bilateral negotiations over fixed 113 Dataset Scenario Issues Size CRA Artifacts trading (Painting, Lamp, Album) 119 DND General items (Ball, Hat, Book) 6,251 CA Campsite Neighbors (Food, Water, Firewood) 1,030 JI Job Recruiter-Worker (Salary, Day-off, Position, Company, Workplace) 2,639 Table 7.1: The datasets used in our analysis. We sample 200 test instances for our evaluation. The training data is used for fine-tuning our baseline model. quantities of some issues (Table 7.1). Each player has predefined preferences towards each issue (called values), which define the player’s priority order. The task for a player is to engage in a dialogue with their partner, whose priorities are not known beforehand, and divide all the available items so as to maximize their total points scored – computed based on the assigned preferences and the final agreed deal. MIBT is a generic framework that can capture many interesting real-world negotiations and, hence, has been adopted in NLP, beyond NLP (Mell and Gratch, 2017), and in the industry as well (e.g., iDecisionGames1 ). Some datasets like CA also report subjective metrics like satisfaction from the outcome and the player’s likeness towards their partner. Some include utterance-level dialogue act (e.g., CRA) and strategy (e.g., CA) annotations. We use this metadata to design a variety of tasks to evaluate LLMs. 7.1.2 Task Design We build a total of 35 tasks based on the datasets described above. Our guiding principle is the following: Every task is designed from the perspective of an agent that is negotiating itself. This is crucial since it governs the kind of information that we can use in the prompts for LLMs (e.g., explicit partner preferences are not visible). 1https://idecisiongames.com/promo-home 114 To describe our tasks in a principled way, we categorize them based on: 1) Objectivity, 2) Time stage, and 3) Task Type (Figure 7.2). We discuss these criteria below. Although these are designed with the domain of negotiations in mind, many nuances discussed below directly apply to traditional task-oriented or open-domain dialogues (Wang et al., 2023) as well, making our evaluation framework broadly applicable. 7.1.2.1 Objectivity Most tasks are objective, meaning that they test whether the models can accurately answer questions on the objective facts about a negotiation. These tasks can be evaluated with standardized metrics from the literature. In contrast, a few tasks are categorized as Subjective, where there is no one correct answer. This includes predicting outcome satisfaction and partner likeness as reported by the human players in a post-survey. These subjective measures have been argued to be crucial for successful task outcomes, especially in scenarios involving repeated interactions between the same stakeholders (Oliver et al., 1994; Mell et al., 2019; Chawla et al., 2021a). Since they govern the user’s perception of the agents in human-agent interactions, the models must be capable of capturing these effects. Hence, we also design tasks to judge (by computing correlations) whether the model’s predictions align with those reported by the human players. We consider response generation as another subjective task, which is evaluated via automatic and human evaluation. 7.1.2.2 Time Stage A negotiation dialogue typically unfolds in 3 stages: Start, During, and End. We make this seemingly trivial distinction since it directly impacts the information used while designing the prompts: in 115 a given stage, we only input the information that an agent would have access to (from its own perspective) at that point in the negotiation. At the start (where no actual dialogue has occurred yet), the agent is only aware of the basic information from the negotiation context, including the scenario description as well as the item counts and priority values for each issue (Figure 7.2). During the dialogue, the LLMs also get access to the partial historical conversation. Finally, tasks defined at the end of the negotiation use the entire dialogue within their prompts. 7.1.2.3 Task Type We categorize the tasks into four types based on the kind of capabilities they are designed to test. Comprehension: In the Start stage, these tasks assess the model’s ability to understand the negotiation context, which is fundamental for any sensible negotiation agent. This includes asking the total number of available items, the priorities for the issues, and the maximum points the agent can earn from the negotiation. In the End stage, Comprehension tasks evaluate whether the models can answer questions about the final outcomes, including details of the final deal and the subjective measures of satisfaction and partner likeness. Proficiency in these tasks can aid in automatically evaluating human-human or human-agent negotiations. Annotation: These involve detecting dialogue acts (e.g., disagree, propose) or linguistic strategies (e.g., elicit-preference, self-need) in utterances, along with parsing incoming offers. Annotation tasks are critical components of the Natural Language Understanding (NLU) module for a typical modular dialogue system. During the negotiation, they are essential in understanding the input from the partner to decide the agent’s future actions. 116 Partner Modeling: These tasks are closely linked to Theory of Mind (ToM) (Frith and Frith, 2005), where the goal is to infer the mental states of the partner. In our work, we focus on predicting the partner’s priorities and subjective assessments. Partner Modeling is particularly vital during the negotiation, where understanding the other party’s requirements and adapting one’s strategy accordingly can be key for efficient and favorable outcomes. Generation: Finally, we consider the task of response generation based on context and dialogue history. This includes crafting offers and counteroffers, responding to incoming offers, and other communicative elements that advance the negotiation toward a successful conclusion. This is an essential requirement for a conversational agent and tests whether the models can reason through the current state and respond in a way that is both contextually appropriate and strategically advantageous. 7.1.3 Prompting for Evaluation To assess LLMs’ capabilities on our proposed tasks, we develop prompt templates that are later filled with the relevant information from the dataset instances. The prompts for zero-shot evaluation include (in order) the task description, issue counts & values, the dialogue or an utterance, and the question. For our analysis in Section 7.4, we also employ alternative prompting strategies, such as few-shot prompting and Chain-of-Thought (CoT) (Wei et al., 2022b). We provide task-specific details in Appendix E.1, including the questions used in the prompts and the dataset applicable for each task. We also provide example prompts in Appendix E.2. 117 7.2 Experimental Design Our primary goal is to analyze the effectiveness of state-of-the-art LLMs for strategic negotiation interactions. Here, we discuss the compared models and evaluation metrics for the results presented in Section 7.3 (overall zero-shot results) and Section 7.4 (task-specific analysis to gain further insights). 7.2.1 Baselines We use the Majority-label model and a fine-tuned Flan-T5-base (Chung et al., 2022) as baselines. We selected Flan-T5 for its superior performance across a wide range of NLP tasks and flexibility to handle diverse input and output formats. For fine-tuning, we use the AdamW optimizer (Loshchilov and Hutter, 2019) with an initial learning rate of 5e-5 and a linear scheduler. We train the models for a maximum of 5 epochs with a batch size of 8, picking the best-performing checkpoint for evaluation. We provide detailed information about this training process in Appendix E.3, including data processing steps and the computational resources used. 7.2.2 LLMs for Evaluation We compare a variety of LLMs, top-performing on popular leaderboards at the time of experimentation. Among proprietary models, we choose OpenAI’s GPT-3.5-Turbo (OpenAI, 2022) and GPT-4 (OpenAI, 2023). Among open-source LLMs, we experiment with Mistral-7B (Jiang et al., 2023) and Wizard-13B (Xu et al., 2023), along with Vicuna 13B and 33B variants (Chiang et al., 2023). 118 GPT-3.5 GPT-4Wizard13b Vicuna13b Vicuna33bMistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.767 0.817 0.518 0.541 0.547 0.727 (a) Comprehension (Start) Majority Flan-T5GPT-3.5GPT-4 Wizard13b Vicuna13b Vicuna33b Mistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.302 0.601 0.468 0.653 0.384 0.379 0.347 0.364 (b) Comprehension (End) Majority Flan-T5GPT-3.5GPT-4 Wizard13b Vicuna13b Vicuna33b Mistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.5 0.467 0.373 0.417 0.445 0.266 0.216 0.092 0.0 0.2 0.4 0.6 0.8 1.0 PCC 0 -0.01 0.21 0.3 0.12 0 0.11 0.11 (c) Comprehension (Subjective) Majority Flan-T5GPT-3.5GPT-4 Wizard13b Vicuna13b Vicuna33b Mistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 F1 0.06 0.51 0.525 0.624 0.323 0.39 0.43 0.334 (d) Annotation Majority Flan-T5GPT-3.5GPT-4 Wizard13b Vicuna13b Vicuna33b Mistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.309 0.717 0.609 0.742 0.363 0.363 0.355 0.458 (e) Partner Modeling Majority Flan-T5GPT-3.5GPT-4 Wizard13b Vicuna13b Vicuna33b Mistral7b Models 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.475 0.505 0.368 0.413 0.314 0.219 0.131 0.132 0.0 0.2 0.4 0.6 0.8 1.0 PCC 0 0.09 0.27 0.39 0.02 0.17 0.2 0.18 (f) Partner Modeling (Subjective) Figure 7.3: Overall results for zero-shot evaluation of LLMs. F1: macro F1 over all labels, PCC: Pearson Correlation Coefficient. Each bar shows the average result across all suitable tasks in the category. For example, as per (b), GPT-4 gets 65.3% Accuracy on average for Comprehension tasks in End time stage. Selected tasks for building these plots and full results are in Appendix E.1. 7.2.3 Evaluation Metrics The metrics depend on the kind of task being evaluated. For annotation tasks, we use MacroF1 to account for label imbalance. For subjective tasks that measure satisfaction and likeness scores, we use correlations with human judgments. For response generation, we report both automatic evaluation (BLEU and ROUGE) and human evaluation (response coherency and strategic appropriateness). For all others, primarily comprehension and partner modeling tasks, we rely on Accuracy to measure the performance. The metric used for each task can also be found in Table E.4 of Appendix E.1. 119 7.3 Results Figure 7.3 and Table 7.2 present the key overall trends for zero-shot evaluation. We provide the complete task-wise results in Appendix E.1 and discuss the key findings below. 7.3.1 Comprehension Tasks As evident in Figure 7.3a, GPT-4 outperforms all other models on Comprehension tasks in the Start stage with an average accuracy of 81%. This was followed by GPT-3.5 and Mistral7B, both still performing above 70%. However, note that these tasks primarily involve questions about understanding the negotiation context given explicitly in the prompt. Hence, a simple rule-based parser would achieve 100% performance for these tasks – indicating a huge room for improvement. The tasks in the End stage are more challenging, requiring the models to comprehend the dialogue, perform reasoning to extract the agreed deal, and then compute the answer. As per Figure 7.3b, most models fail to perform well, only marginally better than the trivial Majority baseline. GPT-4 still exhibits superior performance, even surpassing Flan-T5. In comparison, since all evaluated instances end in an unambiguous final deal, an expert human can achieve nearly a perfect score on these tasks. For subjective tasks, we analyze if the LLMs’ self-assessments of outcome satisfaction and likeness for the partner align with those reported by the human players. The results from Figure 7.3c show that this is not the case, with the models performing poorly on Accuracy and PCC. This leaves uncertainties about the LLMs’ abilities to capture the psychological states of the human players in negotiations. Future work on understanding the psychological personalities exhibited by LLMs can help to explain these observations. 120 7.3.2 Annotation Tasks Figure 7.3d reveals that both GPT-3.5 and GPT-4 beat the fine-tuned Flan-T5 baseline on annotation tasks, achieving 62.4% and 52.5% Macro-F1 respectively. Task-wise results (Table E.4 of Appendix E.1) show that these trends are driven by the poor performance of Flan-T5 on dur_dial_act_ji task, but on most tasks, Flan-T5 performs better than GPT-4, indicating room for improvement in zero-shot scenarios. Further, models find it more challenging to detect negotiation strategies than dialogue acts, probably because the strategies are often expressed in a more subtle manner. While these results are without few-shot examples or prior utterances – we will explore these possibilities later in Section 7.4. 7.3.3 Partner Modeling Tasks Figure 7.3e shows the results for the objective partner modeling tasks (inferring partner priorities) when the entire dialogue is included in the prompt. GPT-4 achieves the best performance, indicating strong out-of-the-box Theory of Mind (ToM) abilities, as opposed to smaller opensource variants that mostly perform similarly to the Majority baseline. Later in Section 7.4, we analyze how the model performance varies by varying the number of utterances seen by the model. Meanwhile, for inferring the partner’s subjective assessments (Figure 7.3f), Flan-T5 achieves the best accuracy of 50.5%, while GPT-4 gets the highest PCC score of 0.39 – indicating that all models perform poorly when it comes to assessing the subjective perception of their partners. Prior work suggests that, whenever available, including the partner’s demographics, personality, and emotion expression in the dialogue can improve these predictions (Chawla et al., 2023a). 121 Model DND CA BLEU↑ Rouge-L↑ Coherency↑ Strategy↑ BLEU↑ Rouge-L↑ Coherency↑ Strategy↑ Human 4.32 4.26 4.44 3.56 Flan-T5 .167 .453 4.16 4.14 .028 .165 3.50∗ 2.94∗ Mistral7b .010 .130 3.86∗ 3.24∗ .010 .130 3.84∗ 3.16 Wizard13b .032 .190 3.12∗ 3.28∗ .017 .135 3.60∗ 3.18 Vicuna13b .022 .172 3.56∗ 3.36∗ .015 .135 3.98 3.34 Vicuna33b .038 .216 3.92 3.84 .016 .147 4.52 3.44 GPT-3.5 .030 .200 3.72∗ 3.60∗ .025 .162 4.20 3.34 GPT-4 .017 .178 4.48 4.00 .011 .149 4.64 3.52 Table 7.2: Results on response generation. BLEU and Rouge-L are computed on the full testset (200 instances). Coherency and Strategy are based on a human evaluation of a random subset of 50 examples on a scale of 1-5. ∗ means significantly worse performance than the Human reference according to a Paired Bootstrap test. 7.3.4 Generation Tasks We report the results in Table 7.2. Flan-T5 beats other models on BLEU and ROUGE by a huge margin, probably due to dataset-specific fine-tuning. LLMs fail to follow the utterance style and structure in the dataset, even though asked to do so in the prompts, resulting in poor overlap with the reference (examples in Appendix E.5). This can be attributed to the instruction-tuning step for training the LLMs. A poor overlap with the reference does not directly imply that the response is invalid. Hence, we additionally perform an expert human evaluation (by an author of this work) on a subset of 50 examples. We measure two dimensions on a scale of 1-5: Coherency (How appropriate is the response given the dialogue history?) and Strategy (How strategic is the response given the agent context and the goal of maximizing the performance?). 122 GPT-4 scores the highest on Coherency while the Human reference baseline performs the best on Strategy for both datasets. However, the differences between these methods are insignificant, indicating that GPT-4 performs similarly to the average crowdsourcing worker. Further, the models perform better on Coherency than Strategy – generating a fluent and contextually appropriate response is easier than generating a strategic response aligned with the task goals. This can be attributed to the standard autoregressive training followed for these LLMs. While the models learn to output a plausible next utterance, they fail to do so strategically – indicating task-specific tuning might still be necessary for improving performance in strategic social influence settings. Finally, the models perform much worse on Strategy for CA than DND, with the former involving a more complex negotiation scanario and richer language use. GPT-4 performing similarly to the crowdsourcing workers is, although remarkable, insufficient for downstream applications in conversational AI and pedagogy. Prior work in negotiation agents (Lewis et al., 2017; He et al., 2018; Chawla et al., 2023c) also relies on a combination of supervised learning (learning the average behavior) and RL (explicit training for a task-specific reward). Hence, we perform an error analysis and discuss the key limitations below to guide future work. Sophisticated Incoherent Generations: Handling off-topic interaction remains challenging, and trivial issues like repetition still persist. In some cases, we also found sophisticated coherency errors. Given the dialogue history from DND, “...YOU: i cant accept that . how about i get the book and 1 hat , and you can get the balls and 1 hat ? THEM: i can’t accept that deal... can i have one of each ?”, Wizard13b replies: “YOU: I understand you want one of each. How about I get the book and one hat, and you get the balls and one hat? That way, we both get something we want.”. Even though the output seems novel and valid, a closer look reveals that it is simply a paraphrase of YOU’s 123 previous offer already rejected by the partner (THEM). We conclude ensuring coherency remains an open challenge and cannot be assumed to hold for LLMs. Failure to Incorporate Strategic Cues: Generating strategically sound responses is difficult for all models. The models often fail to cater to their own priorities and struggle to utilize the information shared by the user. In one case from CA, even though the partner says, “I am good at making fire and have a starter so I can give up all the wood.”, the models fail to use this information in their counter offers. Overly Agreeable Behavior: In general, the models tend to show agreement even with unfair offers, indicating that they are open for exploitation. We tried varying the prompts to adjust the model behavior, but this proved to be challenging. Future work must focus on improving robustness and reasoning abilities of LLMs. 7.4 Discussion We now analyze the impact of popular prompting strategies, focusing primarily on GPT-4, given its superior overall performance. 7.4.1 CoT Prompting We focus on tasks that require multi-hop arithmetic reasoning. These involve computing the maximum points possible out of a negotiation and inferring the actual total points scored in the given dialogue. We observe extremely high gains using CoT with GPT-4, achieving nearly 100% accuracy on 3 out of 4 tasks (Figure E.2 of Appendix E.4). This provides further evidence that CoT 124 prompting is useful for LLMs (Madaan et al., 2023), highlighting its effectiveness in negotiation tasks as well. We provide a sample CoT prompt in Table E.8 of Appendix E.2. 7.4.2 Using Prior Utterances for Annotation Tasks We include two prior utterances in the prompts for annotation tasks. The results are mixed (Figure E.1 in Appendix E.4) – the performance improves in only one task while showing minor degradation in the remaining three. We expect the type of annotation labels and the data collection methodology to play a role here, but the results also indicate that irrelevant utterances might even confuse the models. Ultimately, including prior utterances must be a domain-specific choice based on the observed validation performance. 7.4.3 Few-shot Prompting One of the most remarkable abilities of LLMs is In-Context Learning or ICL (Wei et al., 2022a), which enables LLMs to grasp the underlying task better and follow instructions more effectively. We investigate whether adding two randomly sampled examples in annotation tasks could enhance LLMs’ performance via ICL. We find that ICL improves performance on two out of four tasks, showing evidence that few-shot examples can improve the model performance (Figure E.1 in Appendix E.4). Selecting the best examples for ICL is an active area of research, and we expect these advancements to help tasks designed for strategic interactions as well. 125 7.4.4 Varying the Number of Seen Utterances We ask two research questions: 1) Do additional utterances confuse the model about its own context, which is explicitly provided in the prompt? and 2) Does the model successfully utilize the additional utterances to infer the partner’s context? We observe that the answer to both of these questions is ‘Yes’ (Figure E.3 in Appendix E.4). In general, when asked about its own preferences, the models’ performance degrades when more and more utterances are supplied. However, when asked about the partner preferences, the models’ performance increases as more utterances are seen. One way to explain this is through recency bias – the models focus more on the recently provided information, which improves partner modeling but hurts the models’ comprehension of their own context given at the start of the input. 7.5 Conclusion We devised a methodology to systematically analyze the multifaceted capabilities of LLMs in negotiations. When evaluated out-of-the-box, GPT-4 outperforms other LLMs while still leaving room for improvement for most tasks. However, CoT and few-shot prompting can somewhat fill this gap for several arithmetic reasoning and annotation tasks. In contrast, the smaller opensource models struggle to do well, performing comparably to the trivial Majority baseline for key Comprehension and Partner Modeling tasks. Based on our results, we conclude that LLMs can indeed be helpful across a number of use cases in negotiation research. This is not only limited to designing dialogue systems but also includes scaling up pedagogical and data collection practices. LLMs capable of partner modeling can help to provide feedback to students who fail to elicit and incorporate their partner’s preferences (Johnson 126 et al., 2019a). Our results on annotation tasks indicate that LLMs can make the annotation process efficient (albeit with a human in the loop), aiding both linguistic strategy analysis and the design of modular dialogue systems. However, we note that this only holds true for proprietary LLMs, stressing the importance of continued efforts to improve smaller open-source models. We also find that extra information in the dialogue can often confuse the models about their own context (provided explicitly in the prompts) – future work must devise ways to enable LLMs to handle longer contexts better, learning to focus only on the relevant parts of the input. Our human evaluation shows that all models struggle with generating strategically appropriate responses. Recent efforts in structured reasoning with LLMs can help to mitigate this issue (Zhou et al., 2024). These approaches are naturally suited for strategic dialogue, where response generation can be broken down into subgoals that are more feasible for the models as per our evaluations in this work, such as sequentially performing comprehension, annotation, and partner modeling before outputting the final response. We discuss other ways in which LLMs can help mixed-motive research in Chapter 9. But before that, we describe our efforts in reshaping the existing dialogue research landscape in the next Chapter, providing an extensive literature review of prior datasets and methods. 127 Chapter 8 Mixed-Motive Tasks Across Diverse Domains: A Literature Review From our discussion in Chapter 2, recall that dialogue research has traditionally been categorized into task-oriented systems, ones focused at assisting users for task completion, and open-domain, which engage in general chit-chat with human partners. Further, we hinted that mixed-motive interactions, such as MIBT-based negotiations, do not trivially belong to any of these two dialogue research categories. This incompatibility is, unfortunately, not only limited to the negotiation scenarios discussed in this dissertation so far. In this Chapter, we zoom out from MIBT interactions and provide numerous other instances of real-world tasks that do not fit this binary categorization, hence, necessitating a deeper investigation. For instance, consider a human user who signs up to interact with a persuasive dialogue system that motivates to engage in physical exercise. The system: 1) uses social cues like small talk and empathy, useful for providing continued support, and 2) employs persuasive strategies to convince the user who, at least in the short-term, is reluctant to indulge in exercise. Although the system is task-oriented or goal-oriented per se, similar to MIBT negotiations, the task here 128 goes beyond the traditional definition of assisting users given the possible misalignment between the goals of the system and the user. Clearly, this system is also not open-domain (Huang et al., 2020). Although the conversations involve open-ended social interactions, there is still a concrete goal of encouraging the user towards a healthier habit. Scenarios with similar mixed-motive circumstances are easily visible in everyday life, including games (Peskov et al., 2020), social platforms (Tan et al., 2016), and therapeutic interactions (Tanana et al., 2016). Despite numerous independent efforts in identifying and analyzing these scenarios, there is a lack of common understanding around them in AI research which inhibits a systematic study in this space. Further, data-driven efforts for dialogue systems and other helpful tools in this space are highly limited and scattered. To this end, our work formally introduces a novel category that encompasses these dialogue tasks to promote principled progress in the future (Section 8.1). Developing dialogue systems for these applications holds importance in AI research for multiple reasons. Tackling these tasks not only involves AI but also aspects of game theory, communication, linguistics, and social psychology, making them an ideal testbed for interdisciplinary AI research. Most importantly, they reflect AI’s general ability to consider their partners’ inputs, tailor the communication strategies, personalize the responses, and lead the conversation actively. In addition, we design a taxonomy for existing datasets, studying their task structure (symmetric vs asymmetric) and context (local vs global). We also organize them by their domains: games, multi-issue bargaining, social good, e-commerce, therapy and support, argumentation, conversational recommendations, and miscellaneous tasks (Section 8.2). We further provide a categorization of existing methods, assisting readers to comprehend the progress and reflect on future directions. We organize them based on system strategy, language generation, partner model, architecture, learning process, and the use of pretrained language models (Section 8.3). Over the 129 years, research in task-oriented and open-domain dialogues has benefited from a myriad of survey efforts (Huang et al., 2020; Zhang et al., 2020c; Ni et al., 2023). Instead, we focus here on discussing efforts for mixed-motive contexts and presenting a thorough review across various domains. 8.1 Expanding the Dialogue Research Categorization The discussed incongruity calls for a new category of dialogue systems that does justice to these ubiquitous real-world interactions not captured by existing dialogue categorizations. We label this category social influence dialogue systems which act interactively to achieve desirable behavior or opinion changes in decision-making contexts (Zhang et al., 2020a; Lee et al., 2020) 1 . This calls for an active role by the system, distinguishing them from other well-studied scenarios, such as purely task-oriented, where systems passively assist their partners to complete tasks, and open-domain, which target social companionship. Key tasks that fall into this category include persuasion (Wang et al., 2019), aiming to change users’ undesirable attitudes or behaviors, and negotiation, aiming to change the users’ perspective to achieve a common ground (Lewis et al., 2017). At a conceptual level (Figure 8.1), these interactions contain two kinds of conversational content, in some aspects, falling in between the existing dialogue categories. The task-oriented content focuses on a domain-specific goal, like persuading for donation, bargaining with trade-offs, or encouraging healthier habits. In addition, these interactions may also contain social content, such as small talk, empathy, or self-disclosure. The task-oriented content provides a context for social interactions and depending on the task setup, social content is rather optional. However, if 1Social influence is the change in thoughts, feelings, attitudes or behaviors resulting from any interaction with an individual or a group (Rashotte, 2007; Gass, 2015). The term captures the primary objectives of the systems for mixed-motive tasks discussed in this dissertation. 130 present, it can in turn build rapport and enhance the user-system relationship for improved task outcomes (Liao et al., 2021). Figure 8.1: A conceptual overview. Connections with task-oriented and open-domain systems: Our proposed category is not meant to be disjoint from these traditional categories. However, it still uniquely brings together the various mixed-motive tasks that capture aspects of social influence, which is fundamentally absent from how we primarily define dialogue tasks in the community. We hope defining a new category would foster dedicated efforts towards this important aspect of real-world conversations. Task-oriented scenarios focus on collaborative information exchange for a common goal of task completion. For the tasks that we focus on here, the goals of the system and the user can be different and even conflicting, leading to collaborative or non-collaborative interactions. Further, the goals can go beyond the current task (e.g., multiple therapy interactions, repeated negotiations, and so on), leading to social interactions for long-term relationships. If a scenario demands social influence skills from the dialogue system, we consider it under our defined category in this work. For instance, He et al. (2018) studied buyer-seller price negotiations. In such a mixed-motive 131 task, the buyer has a concrete goal of negotiating for a reasonable price (arguably making it taskoriented), but achieving it requires social influence skills of engaging in trade-offs and building a rapport with the seller to reach an agreement. Measures of Success: The above discussion indicates that a comprehensive evaluation of such systems must draw from both task-oriented and open-domain dialogue research. Since extensive surveys exist that discuss the evaluation in these settings (Deriu et al., 2021; Li et al., 2021), we do not cover them in detail here. However, we define three essential axes for evaluation: 1) Linguistic Performance, or the system’s linguistic sophistication based on automatic (e.g., perplexity, BLEU) and human (e.g., fluency, consistency, coherency) evaluation, 2) Task Outcome, or the objective taskspecific goals such as the negotiated price or weight loss after therapy, and 3) Partner Perception, or the subjective evaluation by the user, for instance, the user’s satisfaction, likeness towards the system, and interest in interacting again. In a buyer-seller negotiation, if the seller hates the buyer in the end, no matter how favorable the deal is for the buyer, one might argue that this is still a failed negotiation for the buyer. Hence, we encourage future work to take all three dimensions into account collectively. 8.2 Existing Datasets We now illustrate mixed-motive tasks across numerous domains and application areas that capture various aspects of social influence. In total, we curated 22 datasets from prior work, spanning 12 publication venues, 4 languages, and 7 application domains (see Appendix F.1 for details on the compilation process). In general, the datasets capture the following information about an interaction: the non-conversational context for the participants (e.g., negotiation preferences or 132 Name (Citation) Domain Source Structure Context # of Parties STAC (As16) Games Crowdsource Symmetric Global Multiparty Diplomacy (Pe20) Games Crowdsource Asymmetric Global Multiparty DinG (Bo21) Games University game night logs Symmetric Global Multiparty Tabletop (De15) MIBT Face-to-face, Wizard-of-Oz Symmetric Local Bilateral DealOrNoDeal (Le17) MIBT Crowdsource Symmetric Local Bilateral CaSiNo (Ch21) MIBT Crowdsource Symmetric Local Bilateral JobInterview (YaD21) MIBT Crowdsource Asymmetric Local Bilateral PersuasionforGood (Wa19) Social Good Crowdsource Asymmetric Global Bilateral CraigslistBargain (He18) E-commerce Crowdsource Asymmetric Local Bilateral AntiScam (Li20) E-commerce Crowdsource Asymmetric Global Bilateral MI (TaC16) Therapy & Support Psychotherapy session logs Asymmetric Global Bilateral SMS Counseling (Al16) Therapy & Support SMS chat logs Asymmetric Global Bilateral EmpatheticDialogues (Ra19) Therapy & Support Crowdsource Asymmetric Global Bilateral Hotline Counseling (De19) Therapy & Support Synthetic Transcripts Asymmetric Global Bilateral mPED (LiE21) Therapy & Support Physical activity clinical trials Asymmetric Global Bilateral Congressional Debates (Th06) Argumentation U.S. Congressional transcripts Asymmetric Local Multiparty Supreme Court (Da12) Argumentation Oyez.org transcripts Asymmetric Local Multiparty DeCour (Fo12) Argumentation Italian court hearings Asymmetric Local Multiparty ChangeMyView (Ta16) Argumentation Reddit Asymmetric Local Multiparty DDO Debates (Du19) Argumentation debate.org logs Symmetric Local Bilateral Court Debates (Ji20) Argumentation China Court transcripts Asymmetric Local Multiparty Target-Guided (Ta19) Miscellaneous Crowdsource Symmetric Local Bilateral Table 8.1: Overview of the available dialogue datasets. This list is non-exhaustive, and also covers the datasets that have enabled research into various sub-tasks and analyses that can eventually be useful for dialogue systems in respective domains. Key statistics and associated metadata are in Appendix F.1. other role-specific information), the conversation between them, and outcome assessment. Optionally, some datasets also gather participant demographics and personality traits, utterance-level annotations, and subjective evaluations via post-surveys. To understand the structural similarities and differences between these datasets, we design a taxonomy with two primary dimensions: Task Structure (Symmetric vs Asymmetric), and Context Definition (Global vs Local). Task Structure captures whether the participant roles are defined in a symmetric or asymmetric manner. For instance, a typical multi-issue negotiation is symmetric, in the sense that both parties have their own preferences and goals based on 133 which they actively try to reach a favorable agreement (Lewis et al., 2017). On the other hand, a counseling session between a therapist and a patient is asymmetric, where the therapist attempts to emotionally support the patient by employing social influence skills (Althoff et al., 2016). Context Definition relates to whether the input context before each interaction is defined globally or locally. For instance, the PersuasionForGood dataset globally defines the context of persuasion for charity donation, which is kept the same throughout (Wang et al., 2019). On the contrary, in a typical debate, although the rules are defined globally, the conversation topic and arguments are local and can vary for each conversation (Durmus and Cardie, 2019). We present this categorization in Table 8.1. We further categorize the datasets according to their Domain, Source, and the # of parties. We provide key statistics and the available metadata in Appendix F.2. We now briefly discuss the datasets in each domain. Multi-Issue Bargaining Tasks (MIBT): This covers the numerous mixed-motive negotiation datasets discussed so far in this dissertation. DealOrNoDeal (Lewis et al., 2017) involves negotiations over three issues: books, balls, and hats. Other datasets define a more grounded scenario, such as symmetric CaSiNo (Chawla et al., 2021b) negotiations between two campsite neighbors and asymmetric JobInterview (Yamaguchi et al., 2021) negotiations between recruiters and applicants. Games: Strategy mixed-motive games naturally involve social influence dynamics of trust and deception. Diplomacy captures deception in long-lasting relationships, where players forge and break alliances to dominate Europe (Peskov et al., 2020). Catan revolves around the trade of resources for acquiring roads, settlements, and cities (Asher et al., 2016; Boritchev and Amblard, 2021). The players have access to only a subset of resources that they would need, which encourages strategic influence and trade. 134 Social Good: Social influence is critical for social good applications. The tactics must be personalized using knowledge that is both relevant and appealing. PersuasionForGood (Wang et al., 2019) involves asymmetric interactions led by a persuader who attempts to convince the other participant for charity donation by employing a variety of tactics. For instance, Logical Appeal uses reason and evidence to support the argument, while Emotional Appeal elicits specific emotions. E-commerce: These tasks are typically asymmetric. A buyer influences the seller towards a reasonable price, while the seller tries to maximize their own profit. An effective system must combine price-related reasoning with language realization. CraigslistBargain (He et al., 2018) involves open-ended price negotiations with rich strategies like embellishments, side offers, emotional appeals, and using world knowledge. Another example is customer support interactions in AntiScam dataset (Li et al., 2020), where users defend themselves against attackers who try to steal sensitive personal information with convincing arguments. Therapy & Support: Effective therapy aids in the treatment of mental disorders and substance use disorders, along with changing undesirable behaviors like unhealthy diets. A counselor needs to be adaptive, personalized, should understand the core issues, and should facilitate a change in the patient’s perspective (Althoff et al., 2016). In SMS counseling, Althoff et al. (2016) found that linguistic influence, like pushing the conversation in the desired direction, is associated with perspective change. Similar scenarios were captured in other datasets as well (Demasi et al., 2019; Liang et al., 2021). Tanana et al. (2016) collected the Motivational Interviewing dataset where the goal is to elicit and explore the patient’s own motivations for behavior change. EmpatheticDialogues (Rashkin et al., 2019) captured empathetic support interactions, which have been associated with rapport and better task outcomes (Kim et al., 2004; Norfolk et al., 2007; Fraser et al., 2018). 135 Argumentation: In addition to factuality and social proof, a convincing argument must also consider the intensity, valence, authoritativeness, and framing (Chaiken, 1987; Althoff et al., 2014). Tan et al. (2016) released the ChangeMyView logs from Reddit, involving discussions on numerous controversial topics. Other datasets include Debate Dot Org (DDO) debates on diverse topics (Durmus and Cardie, 2019), congressional proceedings (Thomas et al., 2006), and court hearings (Fornaciari and Poesio, 2012; D.-N.-M. et al., 2012; Ji et al., 2020). Conversational Recommendation: Everyday scenarios naturally hold potential for influence via recommendations, for instance, a movie fan persuading their friends to watch a movie that they adore. Li et al. (2018) and Dodge et al. (2016) collected movie recommendation datasets. Instead of guiding the conversation toward a specific movie, the goal is simply to provide recommendations based on facts and personal experiences. Nevertheless, they still provide interesting examples of scenarios that fall into this dialogue category. Miscellaneous: The Target-Guided dataset (Tang et al., 2019) was constructed from the PersonaChat corpus (Zhang et al., 2018). Instead of being open-ended, the Target-Guided scenario defines a concrete goal of naturally guiding the conversation to a designated target subject, thereby making it a social influence setting. 8.3 Developed Methods Most domains discussed above have seen efforts in analyzing human dialogue behaviors and their impact on task outcomes. Examples include analyzing deception in games (Peskov et al., 2020), the impact of persuasive strategies and dialogue acts on charity donations (Wang et al., 2019), cooperative and non-cooperative strategies in MIBT (Chawla et al., 2021b), the use of 136 Method Domain Strategy NLG Partner Model Architecture Learning PLM Le17 MIBT Implicit Generation Simulated User Enc-Dec RL Ya18 MIBT Latent Vectors Generation Simulated User Hierarchical RL Zh20 MIBT DAs Generation DA Look-Ahead Modular RL He18 E-Com, MIBT DA Templates + Retrieval Implicit Modular RL Ya21 E-Com DAs Templates + Retrieval DA Look-Ahead Modular RL Zh19 E-Com, Social Good DAs + Semantic Generation Implcit Hierarchical SL Jo20 E-Com DAs + Semantic Generation Implcit Hierarchical SL Li20 E-Com, Social Good DAs Generation Implicit Dec-only SL GPT Li21 Therapy Implicit Generation Implicit Enc-Dec SL Jh21 Therapy Emotion Labels Retrieval, Generation Simulated User Modular RL BERT, GPT Ha21 Argumentation DAs Rule-based Implicit Modular SL Wu21 Social Good Implicit Generation Implicit Dec-only SL GPT2 Ta19 Misc Keywords Retrieval Implicit Modular SL Table 8.2: Categorization of available methods (non-exhaustive). We only cover papers that explicitly design a dialogue system. NLG: Natural Language Generation, PLM: Pretrained Language Model, MIBT: Multi-Issue Bargaining Task, E-Com: E-Commerce, DA: Dialogue Act, Enc: Encoder, Dec: Decoder, SL: Supervised Learning, RL: Reinforcement Learning. Methods that use RL usually apply it in conjunction with SL. emotion expression for predicting partner perceptions (Chawla et al., 2021a) and studying semantic categories of persuasive arguments on web forums (Egawa et al., 2019). In addition, researchers have targeted various domain-specific subtasks that can be crucial for the eventual development of dialogue systems in this space. This involves research in lie detection methods (Yu et al., 2015; Yeh and Ku, 2021), discourse parsing (Shi and Huang, 2019; Ouyang et al., 2021), strategy prediction (Wang et al., 2019; Chawla et al., 2021b), breakdown detection (Yamaguchi et al., 2021), outcome prediction (Chawla et al., 2020; Dutt et al., 2020; Sinha and Dasgupta, 2021), and argument mining (Dutta et al., 2022). Research that directly targets the development of dialogue systems in this space is still nascent. Among other challenges like limited cross-cultural diversity and relatively smaller dataset size, social influence dialogue settings pose a unique challenge: an average human often exhibits sub-optimal strategic behaviors in these tasks (Wunderle, 2007; Babcock and Laschever, 2009). 137 This means that standard sequence-to-sequence approaches trained on these collected datasets using supervised learning are fundamentally insufficient for developing effective dialogue systems. Hence, prior work has put special attention to the system strategy, employing different ways to model the strategy and language together. We design a taxonomy of methods developed, assisting readers to comprehend the progress and reflect on future directions. We organize them based on the system strategy, language generation, partner model, architecture, learning process, and the use of pretrained language models. We present annotations for all the surveyed methods in Table 8.2 and discuss the common categories in brief below. 8.3.1 Strategy Representation Implicit: The most obvious way to represent the system strategy is implicitly, without any intended decoupling between system strategy and response realization. This corresponds to the usual sequence-to-sequence framework that has been a standard baseline for the methods developed in this space. An important example is the work by Lewis et al. (2017), one of the early efforts in training end-to-end dialogue models for mixed-motive tasks. The authors employed a neural network based on GRUs, one for encoding the negotiation context, one to encode the dialogue utterances, and two recurrent units to generate the output agreement in a bidirectional manner. Latent vectors: Yarats and Lewis (2018) explored latent vectors to decouple utterance semantics from its linguistic aspects. Their hierarchical approach first constructs a latent vector from the input message, which is then used for response generation and planning. These latent vectors are 138 trained to maximize the likelihood of future dialogue messages and actions, which enables the decoupling between semantics and realization. Dialogue Acts (DAs): Dialogue Acts, such as greeting, offer propose, agreement, or disagreement, are effective at capturing a high-level structure of the dialogue flow, reducing the model strategy to first predicting the dialogue act for the next response. The use of DAs makes it convenient to apply reinforcement learning approaches (Zhang et al., 2020b; Yang et al., 2021) while also aiding in developing a modular dialogue system design (He et al., 2018). Semantic Strategies: The structural properties expressed by DAs are insufficient for capturing semantics like emotion, small talk, and appeal. To better incorporate them, researchers have relied on additional utterance-level annotations grounded in prior theories (Wang et al., 2019; Chawla et al., 2021b). These strategies have been used in conjunction with DAs (Zhou et al., 2019b; Joshi et al., 2020). 8.3.2 Language Generation An important aspect of the system design is an effective way to realize the language, that is, to generate the next response so that it portrays the desired strategic behaviors. Borrowing from task-oriented and open-domain research, dialogue models in this space use a variety of methods to generate the final system response. Templates and retrieval methods: Predefined templates and response retrieval from the training data simplify the generation pipeline, improving controllability and modularity. He et al. (2018) used templates in their generator, which are later filled by retrieving similar responses from the 139 data. This allowed the authors to explore supervised and reinforcement learning at the level of DAs for the influence strategy of the system. Conditional Generation: Text generation methods result in more diverse responses but negatively impact controllability and interpretability. Prior work relies on autoregressive text generation conditioned on the dialogue history, non-conversational context, and additional annotations. These are either encoder-decoder networks (Lewis et al., 2017; Li et al., 2020; Joshi et al., 2020) or use a decoder-only design (Li et al., 2020). A useful future direction is to combine generation with retrieval for knowledge-grounded settings like argumentation. Similar methods have been explored for other NLP tasks like open-domain question answering and question generation (Lewis et al., 2020). 8.3.3 Partner Modeling Partner modeling refers to inferring the mental states of the partner based on the conversation. For example, understanding the cause that the persuadee cares about in the PersuasionForGood context or inferring the priorities of the partner in DealOrNoDeal negotiations. Building an accurate partner model is essential for guiding the decision-making of the system (Baarslag et al., 2013; Zhang et al., 2020b). Hence, we discuss various ways in which prior work tackles partner modeling. Implicit: A majority of the efforts do not explicitly model the behavior of the partner, but instead, this behavior implicitly guides the next response of the sequence-to-sequence dialogue system pipeline. 140 Simulated User: Lewis et al. (2017) trained a simulated user on the available data in a supervised manner. This was then used to further train the dialogue system. Instead of inferring mental states explicitly, this takes a more behavioral approach of estimating the future actions of the partner and using these for training via reinforcement learning. Dialogue Act Look-Ahead: With a similar idea, Zhang et al. (2020b) proposed OPPA model with a look-ahead based partner modeling strategy at the level of DAs. At each step, OPPA first estimates the user’s future DA, which is then used to select the next DA of the system. The authors found significant improvements on the DealOrNoDeal task. Yang et al. (2021) used a similar method for buyer-seller negotiations. Taking a different approach, Chawla et al. (2022) trained a ranking model to directly predict the hidden preferences of the partner in a multi-issue negotiation. Instead of predicting future actions, these methods assume that the partner’s behavior can be explained by their context and goals in the dialogue. However, this approach has yet to be used in an end-to-end system. 8.3.4 Training Architecture Choices: One crucial aspect is the architecture design: End-to-end (Lewis et al., 2017; Radford et al., 2019) vs Modular (He et al., 2018). While end-to-end methods improve diversity and need less manual effort, a modularized design enhances controllability and explainability. Perhaps, this is why modular methods are popular in large-scale models (Hadfi et al., 2021). Improving the control of desired variables such as topics, strategy, or emotion in the end-to-end methods is an open area of research and is yet to be explored for the strategic settings discussed in this work. 141 Supervised Learning (SL) and Reinforcement Learning (RL): Zhou et al. (2019b) used SL to train a hierarchical encoder-decoder for generating the next response and used Finite State Transducers (FSTs) to encode the historic sequence of DAs and persuasive strategies into the model, showing improvements in negotiation and persuasion tasks. The performance was later improved by Joshi et al. (2020), who replaced FSTs with Graph Neural Networks to better model the interdependencies. Others have relied on RL to explicitly optimize the model on taskspecific objective outcomes. While SL trains the model to mimic the average human behavior, RL techniques, such as those based on REINFORCE (Williams, 1992), allow the system to explore its own strategies in the wild while being guided by one or more overall reward metrics. Lewis et al. (2017) used RL in negotiations, with the final points scored in the agreed deal as the reward. More recent work employed RL to incorporate simplistic partner models into the decision-making process of the dialogue system, showing improvements in negotiation tasks (Zhang et al., 2020b; Yang et al., 2021). Multi-tasking and Pretraining: Limited efforts have also explored multi-tasking and pretrained language models, which provide promising ways to deal with the challenge of insufficient training data. Liu (2021) trained a sequence-to-sequence transformer on a mix of Cornell Movie Dialogue corpus (Danescu-Niculescu-Mizil and Lee, 2011) and psychotherapy data. Li et al. (2020) fine-tuned the GPT model (Radford et al., 2018) while employing multi-tasking to incorporate intents and slots for both the human and the system. Wu et al. (2021) recently introduced ARDM which uses GPT2 (Radford et al., 2019) to separately encode the utterances of the human and the dialogue system, reducing the reliance on additional annotations. 142 8.4 Conclusion Going beyond MIBT-based negotiation tasks, we illustrated a wide variety of realistic mixedmotive scenarios that contain aspects of social influence and do not strictly fall into the existing categorizations of task-oriented and open-domain systems. To capture these settings more systematically, we proposed a novel category of dialogue tasks and presented an extensive survey of the existing datasets and methods. In the next Chapter, we discuss the limitations of prior efforts and the prevailing challenges, considering both the mixed-motive tasks based on MIBT as well as the broader landscape laid out in this Chapter. Alongside, we provide our recommendations for future work in this area. 143 Chapter 9 Future Directions Proficiency in managing mixed-motive situations is integral to our everyday communication. This opens up numerous exciting opportunities for automated tools that help to make our personal and professional interactions more effective as well as efficient. On the one hand, research in this area can help directly – by designing dialogue systems (e.g., conversational assistants) or other tools for practitioners such as investors, therapists, lawyers, and educators. On the other, more implicitly, these innovations can improve our understanding of human cognition and scale up pedagogical practices that better prepare students for handling realistic mixed-motive scenarios. As evident from the discussion in the previous Chapter, the past few years have seen an increasing interest in enabling mixed-motive human-machine dialogue. Yet, building sophisticated and practically useful systems remains a challenging endeavor. We now lay out several limitations of existing work and key challenges that must be addressed moving forward. We provide our recommendations for future work, focusing specifically on the work presented in this dissertation and, broadly, on the current trends in the literature. Task Design: The MIBT framework (as described in Chapter 2) forms the backbone for the majority of the work presented in this dissertation. As discussed earlier, MIBT has been popular 144 in both academia and industry. We also found it to be extremely helpful in concretely defining the setup for our negotiation tasks. However, we note that there are many interesting extensions to the task design that we used, which make the scenario more complicated but useful in specific, realistic settings. For instance, although our work primarily looked at bilateral negotiations, realistic interactions often involve more than two participants, such as planning out a family vacation or playing the game of Diplomacy or Settlers of Catan. In addition, recall that our work assumes that the player preferences are predefined and remain fixed throughout the interaction. This is enforced in the MIBT setup that we consider since the final objective performance is computed assuming these predefined preferences. However, this assumption may not always hold true. This is easy to see since our preferences towards specific items may change after acquiring new information during the interaction. Taking an example from CaSiNo, a player might start preferring water packages for their camping trip, but this preference can change after learning that there is a river stream close to the campsite – future work should incorporate such situations. Our setup makes another simplifying assumption that the individual items are indivisible. More specifically, either each item (e.g., a food package in CaSiNo or a book in DealOrNoDeal task) goes to a specific player entirely, or it does not. This, again, is not necessarily true in practical scenarios. In complex settings, it may be possible to break down an individual item into sub-parts that are claimed by different players in the final agreements. Consider the following scenario conceived by Professor Jonathan Gratch at USC – Two sisters are negotiating over who gets an orange. As per our setup, the orange can only go to one of the two sisters. However, during the interaction, the sisters realize that one of them only wants to use the pulp to make orange juice while the other only needs the husk for her art project. In such a case, there exists a way to 145 divide the orange that maximizes the value for both the sisters involved. In fact, finding creative solutions that work for all the stakeholders is fundamental to mixed-motive interactions. Clearly, such a scenario is not captured by the formulations discussed in this work and opens up exciting avenues for future research. Incorporating LLMs: Our analysis in Chapter 7 reveals interesting insights about the capabilities of LLMs in mixed-motive contexts. As a note of caution, these insights need to be interpreted only within the scope of the prompting variations that we experimented with. Although we designed the prompts based on careful analysis and consideration following the best practices from prior work, one should assume that other ways of prompting the models with more sophisticated prompt engineering methods could potentially lead to different results. In general, claims about ‘what the models can do’ are far easier to make than ‘what the models cannot do’. Hence, while our goal was to cover the breadth of capabilities based on standard prompting techniques, we encourage future work to rigorously investigate the impact of prompt engineering in-depth on the subset of tasks where LLMs particularly perform poorly in our evaluations. One area where we find that LLMs especially struggle is generating strategically sound responses. We see two primary future directions here. The first way is to improve the strategic reasoning of LLMs themselves, for instance, based on more sophisticated prompting methods and structured planning, by breaking down the multi-hop task of response generation into a sequence of more tractable steps. This is an active area of research due to its broad applications across a range of tasks in NLP (Wei et al., 2022b; Zhou et al., 2022), and we expect these efforts to also aid the incorporation of LLMs in mixed-motive interactions. The second way is to instead build on top of what has already been shown to work well in negotiation dialogue research in NLP. As discussed in Chapter 8, prior work has explored 146 modular dialogue systems (He et al., 2018) that rely on dialogue act based RL policies (instead of utterance-level policies used in Chapter 6). To handle the natural language utterances, these approaches rely heavily on parsing for language understanding and templates for final response generation. These methods involve high manual effort and are unscalable. Instead, we recommend exploring whether LLMs can act as natural language interfaces for precisely controlled RL policies. Effectively, such an approach would leverage LLMs for simpler subtasks of utterance annotation and dialogue-act conditioned response generation while the core policy of the agent is still handled by trained RL models. Another challenge for incorporating LLMs is dealing with hallucinations. These models have been shown to hallucinate or confidently produce false information that is unsupported by the input context or existing facts across several generation scenarios, including dialogue, text summarization, and question answering (Ji et al., 2023). Researchers have proposed several ways to reduce hallucinations (Shuster et al., 2021; Zhao et al., 2023; Dhuliawala et al., 2023). However, we argue that this problem demands a much deeper investigation. Depending on the context, hallucination may or may not be desirable. For instance, for the task of knowledge-grounded dialogue (Hongru et al., 2023), where automated systems are expected to be informative by generating responses conditioned on input knowledge documents, it is essential to design models that limit hallucination and remain faithful to the provided input pieces of evidence. However, for the task of story generation (Xie et al., 2023), where the model is tasked with generating a story, it can be useful for the model to be creative and come up with novel fictional outputs. In a typical mixed-motive interaction, depending on the scenario, hallucination might be acceptable when the models construct arguments to convince their partners to get favorable deals, but it is still not desirable to generate responses that are not aligned with the 147 initial preferences of the system. Hence, merely reducing hallucinations is not enough to deal with this problem. Instead, as also argued by Professor Dan Roth at the Southern California NLP Symposium (SoCal NLP 20231 ), it is necessary to get a handle on hallucination such that it can be controlled based on the task requirements. We encourage future work to investigate this direction. Theory Integration: Most existing modeling efforts are based on crowdsourced datasets. Since crowdsourcing workers may not exhibit optimal strategies, supervised training on these datasets is fundamentally insufficient to build an effective system for applications like pedagogy (teaching social skills to students). Unfortunately, this holds regardless of how the system strategy and partner model are designed. Further, using RL to optimize on objective rewards is also not expected to be enough to reliably learn complex social influence capabilities, especially when the reward design is restrictive. To address this, we recommend tapping into the vast amount of relevant research efforts in social sciences and psychology (Cialdini and Goldstein, 2004; Cameron, 2009; Giles, 2016; Lewicki et al., 2016). Instead of solely relying on the collected data, future work should consider leveraging fundamentals from this research to guide the dialogue policy. Previous works have studied resistance strategies (Petty and Cacioppo, 1977; Knowles and Linn, 2004; Dal Cin et al., 2004; Ahluwalia, 2000). Rucker et al. (2004) found that people resist persuasion differently depending on their beliefs, suggesting personalization can be beneficial. One can also employ the politeness theory (Brown and Levinson, 1978) and model the participants’ face acts to better understand users (Dutt et al., 2020). Evaluation: A key limitation of existing work is the lack of a comprehensive evaluation. Prior work has mainly focused on objective metrics, which only provide a limited view of a model’s 1https://socalnlp.github.io/symp23/index.html 148 performance. A comprehensive evaluation is challenging since it must consider partner perception along with objective outcomes. Building user simulators could potentially alleviate this problem (Li et al., 2016; Jain et al., 2018; Shi et al., 2019). Most existing simulators are developed for taskoriented systems which follow a certain agenda. Future research should study how to use partner modeling to build user simulators suitable for mixed-motive tasks for more efficient and accurate evaluation (He et al., 2018; Yang et al., 2021). For instance, one could potentially design different user personalities and simulate the change in user’s beliefs, opinions, and attitudes accordingly (Yang et al., 2021). Studying Cross-Cultural Effects: Our results from this work are primarily based on the datasets that contain dialogues in English. Consequently, our human evaluations also involve workers from a restricted demographic pool (location: USA and language: English). However, prior research has noted differences in mixed-motive behaviors across cultures (Luo, 2008; Andersen et al., 2018). Hence, it is unclear if our findings directly extend to workers from a different demographic. Such cross-cultural effects should be better explored in the future. Knowledge-Enriched Models: Mixed-motive tasks often involve constantly changing world knowledge such as organization facts and news. Often, the system’s internal state (e.g., the change of task setting from one set of products to a different set) needs to be updated. Retraining the entire system is costly to maintain after the initial development. Consequently, some work has proposed to augment the dialogue system with an internet search ability to generate more factual and updated responses in open-domain dialogues (Komeili et al., 2022). Future efforts in this direction will benefit dialogue systems for mixed-motive tasks as well. Multimodality and Embodied Agents: Being a core function of human communication, realistic mixed-motive interactions involve not just text but rather all possible modalities. Schulman and 149 Bickmore (2009) showed that embodied agents achieve better persuasion results than text-only agents. Other studies have recognized the importance of emotion (Asai et al., 2020; Chawla et al., 2021a). Nguyen et al. (2021) proposed a speech dataset in debates and studied the influence of spoken tactics on persuasiveness across genders. Given these findings, we encourage interdisciplinary efforts in the future to explore multimodality in the context of mixed-motive tasks across diverse domains. A Unified Outlook: One challenge that inhibits progress has been a dearth of large-scale datasets for model training. Mixed-motive tasks are complex for crowdsourcing workers to understand and participate in. Hence, prior work used extensive instructions and tutorials, making the study expensive and time-consuming (Wang et al., 2019; Chawla et al., 2021b). To address this, we recommend the researchers aim for a more unified view of the efforts in this space. First, this would encourage researchers to adopt the best practices from other mixed-motive domains. For instance, most datasets miss out on user attributes like demographics and personality, which have been argued to be crucial (Stuhlmacher and Walters, 1999; Bogaert et al., 2008). Most datasets also ignore the partner’s perception after the interaction is over. This can result in misleading conclusions about the model performance, where the models perform well objectively but hurt the relationship with their partners, thus negatively impacting practical utility (Aydoğan et al., 2020). Secondly, a holistic outlook will promote transfer learning and domain adaptation. Establishing cross-domain performance will help to explore fundamental similarities and differences between seemingly different domains. In addition, the taxonomy for datasets presented in Chapter 8 governs the way systems must be modeled and trained. Task structure is crucial to understanding whether the model can learn from the utterances of all players or just one, and understanding the 150 context definition guides how it must be encoded. Hence, one interesting future direction is joint training on datasets with similar structure and context definitions. Progress in task-oriented and open-domain systems can inspire more unified modeling for mixed-motive tasks involving multiple skills in the same interaction (e.g., a combination of negotiation and persuasion tactics as common in realistic scenarios). Roller et al. (2021) blend various open-domain tasks to address multiple challenges together (e.g., persona-based, knowledgeenriched, etc.). Hosseini-Asl et al. (2020) concatenate structured and unstructured data in taskoriented dialogues and cast the different aspects of task-oriented dialogue systems as sequence generation. Future work should explore similar unified approaches for mixed-motive settings as well, especially since these tasks follow a common conceptual foundation (Figure 8.1), with similar evaluation and theoretical principles (Cialdini, 2009). To encourage this unified view, we encapsulate our insights from the survey effort (Chapter 8) in a theoretical framework, which is presented in Appendix F.3. The framework covers key components for designing a dialogue system for this space – defining system attributes, target audience, underlying modeling techniques, and evaluation mechanisms. Finally, we recommend leveraging the *ACL workshop series on Social Influence in Conversations (SICon) to guide future progress. SICon provides an excellent venue to bring researchers from diverse disciplines together, sharing best practices and exploring novel collaborations. We encourage organizers to promote shared tasks that particularly explore transfer learning techniques in this space. 151 Chapter 10 Ethical Considerations We now discuss several ethical considerations which were taken into account for the work presented in this dissertation. This includes information about our data collection procedure and its public release, the precautions taken for experiments that involved human subjects, and details regarding the external datasets and LLMs used in our experiments. We conclude by providing general recommendations for ethically sound progress on human-machine mixed-motive dialogue in the future. CaSiNo Data Collection: The data collection study (presented in Chapter 3) was approved by our Institutional Review Board (IRB). Each participant signed an Informed Consent document at the beginning of the study, which covered the purpose of the study, warned about potential discomfort, and noted the collection of data and its later use. Further, the participants were informed that they could withdraw at any time. They were also instructed not to use any offensive or discriminative language. The compensation was determined in accordance with the fairness rules defined by our IRB approval process. Additionally, we released the anonymized version of the data for future work by the research community. All personally identifiable information, such as MTurk IDs or HIT IDs, was removed 152 before releasing the data. We used a combination of manual and automated filters to remove instances with any offensive language or dialogue that deviated significantly from the designed negotiation task. Lastly, we note that any mention of the demographics or the psychological personality of the participants is based on self-identified information in our pre-survey and standard procedures of collecting personality measures in the literature. Human Subjects: Multiple studies presented involved efforts by human subjects. Apart from the CaSiNo data collection study, this included curating strategy annotations (Chapter 3), comparing partner modeling performance to a human expert (Chapter 5), evaluating mixed-motive dialogue systems (Chapter 6), and evaluating the response generation performance of LLMs (Chapter 7). The CaSiNo data was collected on Amazon Mechanical Turk, while the evaluation of mixed-motive dialogue systems was carried out on the Prolific platform. For the latter, we followed similar compensation and consent procedures as described above. The rest of the evaluations were carried out by the authors of the respective papers themselves, and hence, they did not require any external crowdsourcing effort. Datasets Used: All the datasets used for the experiments presented throughout the manuscript had been completely anonymized before their release by the respective authors. We conducted a meticulous review of the licensing details for each dataset to ensure that our usage strictly adheres to their intended purposes and scope. We further note that the datasets used are restricted to the English language. However, the techniques developed make no assumptions about the underlying language and should be broadly applicable to other settings as well. LLMs Used: Our analysis with Large Language Models (LLMs) is strictly within the intended scope in accordance with the respective licensing details of the released models. Our approach is 153 consistent with various other recent efforts that aim to evaluate the diverse capabilities of LLMs, ensuring that the use remains within ethical and operational guidelines. Recommendations: Throughout the history of the field, ethics related to mixed-motive research have received significant attention in the literature. Primary concerns include acts of emotion manipulation, deception, bias, and misrepresentation (Lewicki et al., 2016). Naturally, these ethical issues can also creep into the automated systems based on human-human dialogue datasets designed for mixed-motive tasks. Several efforts have already been made in this direction, with many exploring widely accessible LLMs for strategic mixed-motive applications. In addition, the commonly studied mixed-motive task formulations focus on scenarios where social influence aspects appear intentionally by design. However, social influence can also be unintentional, that is, interactions between machines and humans can lead to unintended changes to the attitudes, behaviors, or feelings of the human user (Gass, 2015). For instance, changes in topic preferences after interacting with an LLM on a variety of topics or incorporating biases after interacting with a biased system. In this sense, these ethical concerns are not only limited to systems designed specifically for mixed-motive tasks but extend to any human-facing AI technology. Consequently, as we continue to make unprecedented progress in AI systems that interact with human users via natural means of communication, discussions and regulations about the associated ethics are both timely and necessary. To mitigate possible misuse scenarios or unintended harms, we now lay out our recommendations. First and foremost, in line with the discussion above, we encourage researchers to get a better handle on the social influence implications of the systems deployed at scale. Second, rigorous attempts must be made to ensure that the data collection, design processes, and evaluations strictly abide by the guidelines and regulations laid out by an appropriate IRB or an equivalent 154 committee. Third, the research team needs to develop a thorough plan to monitor and understand the behaviors of the developed systems before deployment. This includes identifying the goals of the system, potential toxic language use, and any discriminatory behaviors. Fourth, investment into improved data collection practices, along with explainable and controllable systems, would help to identify the issues early on and provide ways to avoid them. Fifth, we argue that transparency is the key. All stakeholders must be made aware of the goals and design objectives of the system, along with any known misbehaviors or potential risks. The users must also be informed of any data collected during the deployment phase, and its future purpose should be properly laid out. Sixth, we believe that continuous monitoring of dialogue systems is necessary to ensure that the system performs consistently and does not diverge to unexpected conditions that may incur offensive or discriminative actions. Finally, we encourage utilizing forums like the SICon workshop for principled discussions on this topic, such as through dedicated panel sessions. The fast pace of AI research has made the development of formal ethical guidelines difficult. This makes responsible actions on the part of the research community all the more crucial. The inclusion of a focused section on ethical considerations in research papers and a separate ethics reviewing committee in *ACL conferences are both steps in the right direction. However, this is clearly not sufficient since it is nontrivial to foresee all the underlying ethical concerns or the misbehavior of the systems beforehand. Hence, in addition to following software deployment best practices such as shadow deployment and canary release, promoting an opensource culture to the extent possible and incentivizing rigorous red teaming efforts can help mitigate these concerns. 155 Bibliography Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. 2023. Llmdeliberation: Evaluating llms with interactive multi-agent negotiation games. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Sibel Adalı and Jennifer Golbeck. 2014. Predicting personality with social behavior: a comparative study. Social Network Analysis and Mining, 4(1):1–20. J Stacy Adams. 1976. The structure and dynamics of behavior in organizational boundary roles. Handbook of industrial and organizational psychology, 1175:1199. Rohini Ahluwalia. 2000. Examination of psychological processes underlying resistance to persuasion. Journal of Consumer Research, 27(2):217–232. Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95. Tim Althoff, Kevin Clark, and Jure Leskovec. 2016. Large-scale analysis of counseling conversations: An application of natural language processing to mental health. Transactions of the Association for Computational Linguistics, 4:463–476. Tim Althoff, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. In Proceedings of the International AAAI Conference on Web and Social Media, volume 8, pages 12–21. Steffen Andersen, Seda Ertac, Uri Gneezy, John A List, and Sandra Maximiano. 2018. On the cultural basis of gender differences in negotiation. Experimental Economics, 21:757–778. Sara Asai, Koichiro Yoshino, Seitaro Shinagawa, Sakriani Sakti, and Satoshi Nakamura. 2020. Emotional speech corpus for persuasive dialogue system. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 491–497. Nicholas Asher, Julie Hunter, Mathieu Morey, Benamara Farah, and Stergos Afantenos. 2016. Discourse structure and dialogue acts in multiparty dialogue: the stac corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2721–2727. 156 Robert J Aumann. 1975. Values of markets with a continuum of traders. Econometrica: Journal of the Econometric Society, pages 611–646. Robert Axelrod. 1980. Effective choice in the prisoner’s dilemma. Journal of conflict resolution, 24(1):3–25. Reyhan Aydoğan, Tim Baarslag, Katsuhide Fujita, Johnathan Mell, Jonathan Gratch, Dave de Jonge, Yasser Mohammad, Shinji Nakadai, Satoshi Morinaga, Hirotaka Osawa, et al. 2020. Challenges and main results of the automated negotiating agents competition (anac) 2019. In Multi-Agent Systems and Agreement Technologies, pages 366–381. Springer. Tim Baarslag, Mark Hendrikx, Koen Hindriks, and Catholijn Jonker. 2013. Predicting the performance of opponent models in automated negotiation. In 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 2, pages 59–66. IEEE. Tim Baarslag, Mark JC Hendrikx, Koen V Hindriks, and Catholijn M Jonker. 2016. A survey of opponent modeling techniques in automated negotiation. In 15th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2016, pages 575–576. International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS). Linda Babcock and Sara Laschever. 2009. Women don’t ask: Negotiation and the gender divide. Princeton University Press. Vevake Balaraman, Seyedmostafa Sheikhalishahi, and Bernardo Magnini. 2021. Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 239–251. Carrie Beam and Arie Segev. 1997. Automated negotiations: A survey of the state of the art. Wirtschaftsinformatik, 39(3):263–268. Jim Blascovich, Jack Loomis, Andrew C Beall, Kimberly R Swinth, Crystal L Hoyt, and Jeremy N Bailenson. 2002. Immersive virtual environment technology as a methodological tool for social psychology. Psychological inquiry, 13(2):103–124. Sandy Bogaert, Christophe Boone, and Carolyn Declerck. 2008. Social value orientation and cooperation in social dilemmas: A review and conceptual model. British Journal of Social Psychology, 47(3):453–480. Maria Boritchev and Maxime Amblard. 2021. Ding–a corpus of transcriptions of real-life, oral, spontaneous multi-party dialogues between french-speaking players of catan. In Journées du GdR LIFT. Penelope Brown and Stephen C Levinson. 1978. Universals in language usage: Politeness phenomena. In Questions and politeness: Strategies in social interaction, pages 56–311. Cambridge University Press. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. 157 Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Ben Goodrich, Daniel Duckworth, Semih Yavuz, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4516–4525. Kenzie A Cameron. 2009. A practitioner’s guide to persuasion: An overview of 15 selected persuasion theories, models and frameworks. Patient education and counseling, 74(3):309–317. Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. 2018. Emergent communication through negotiation. In International Conference on Learning Representations. Peter J Carnevale and Dean G Pruitt. 1992. Negotiation and mediation. Annual review of psychology, 43(1):531–582. Shelly Chaiken. 1987. The heuristic model of persuasion. In Social influence: the ontario symposium, volume 5, pages 3–39. Kushal Chawla, Rene Clever, Jaysa Ramirez, Gale Lucas, and Jonathan Gratch. 2021a. Towards emotion-aware agents for negotiation dialogues. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1–8. IEEE. Kushal Chawla, Rene Clever, Jaysa Ramirez, Gale M Lucas, and Jonathan Gratch. 2023a. Towards emotion-aware agents for improved user satisfaction and partner perception in negotiation dialogues. IEEE Transactions on Affective Computing. Kushal Chawla and Gale Lucas. 2020. Pilot: Winner of the human-agent negotiation challenge at ijcai 2020. arXiv preprint arXiv:2009.06781. Kushal Chawla, Gale Lucas, Jonathan May, and Jonathan Gratch. 2020. Exploring early prediction of buyer-seller negotiation outcomes. arXiv preprint arXiv:2004.02363. Kushal Chawla, Gale Lucas, Jonathan May, and Jonathan Gratch. 2022. Opponent modeling in negotiation dialogues by related data adaptation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 661–674, Seattle, United States. Association for Computational Linguistics. Kushal Chawla, Jaysa Ramirez, Rene Clever, Gale Lucas, Jonathan May, and Jonathan Gratch. 2021b. Casino: A corpus of campsite negotiation dialogues for automatic negotiation systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Kushal Chawla, Weiyan Shi, Jingwen Zhang, Gale Lucas, Zhou Yu, and Jonathan Gratch. 2023b. Social influence dialogue systems: A survey of datasets and models for social influence tasks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 750–766. 158 Kushal Chawla, Ian Wu, Yu Rong, Gale Lucas, and Jonathan Gratch. 2023c. Be selfish, but wisely: Investigating the impact of agent personality in mixed-motive human-agent interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13078–13092. Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. R. B. Cialdini. 2009. Influence: Science and Practice. fifth ed. Pearson/Allyn & Bacon, Boston, MA. Robert B Cialdini and Noah J Goldstein. 2004. Social influence: Compliance and conformity. Annual review of psychology, 55(1):591–621. Stephen W Clopton. 1984. Seller and buying firm factors affecting industrial buyers’ negotiation behavior and outcomes. Journal of Marketing Research, 21(1):39–53. Gert Cornelissen, Siegfried Dewitte, and Luk Warlop. 2009. Social value orientation as a moral intuition: Decision-making in the dictator game. ACR Asia-Pacific Advances. Rachel G Curtis, Tim D Windsor, and Andrea Soubelet. 2015. The relationship between big-5 personality traits and cognitive ability in older adults–a review. Aging, Neuropsychology, and Cognition, 22(1):42–71. Cristian D.-N.-M., Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. In Proceedings of the 21st international conference on World Wide Web, pages 699–708. Sonya Dal Cin, Mark P Zanna, and Geoffrey T Fong. 2004. Narrative persuasion and overcoming resistance. Resistance and persuasion, 2:175–191. Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pages 76–87. 159 Celso M de Melo, Peter Carnevale, and Jonathan Gratch. 2011. The effect of expression of anger and happiness in computer agents on negotiations with humans. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3, pages 937–944. Orianna Demasi, Marti A Hearst, and Benjamin Recht. 2019. Towards augmenting crisis counselor training by improving message retrieval. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pages 1–11. Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. Goemotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054. Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54(1):755–810. David DeVault, Johnathan Mell, and Jonathan Gratch. 2015. Toward natural turn-taking in a virtual human negotiation agent. In 2015 AAAI Spring Symposium Series. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495. Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H Miller, Arthur Szlam, and Jason Weston. 2016. Evaluating prerequisite qualities for learning end-to-end dialog systems. In 4th International Conference on Learning Representations, ICLR 2016. Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 15–24. Sebastian Duerr and Peter A Gloor. 2021. Persuasive natural language generation–a literature review. arXiv preprint arXiv:2101.05786. Esin Durmus and Claire Cardie. 2019. A corpus for modeling user and language effects in argumentation on online debating. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Ritam Dutt, Rishabh Joshi, and Carolyn Rose. 2020. Keeping up appearances: Computational modeling of face acts in persuasion oriented discussions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7473–7485. 160 Subhabrata Dutta, Jeevesh Juneja, Dipankar Das, and Tanmoy Chakraborty. 2022. Can unsupervised knowledge transfer from social discussions help argument mining? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7774–7786, Dublin, Ireland. Association for Computational Linguistics. Ryo Egawa, Gaku Morio, and Katsuhide Fujita. 2019. Annotating and analyzing semantic role of elementary units and relations in online persuasive arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 422–428. Mark G Ehrhart, Karen Holcombe Ehrhart, Scott C Roesch, Beth G Chung-Herrera, Kristy Nadler, and Kelsey Bradshaw. 2009. Testing the latent factor structure and construct validity of the ten-item personality inventory. Personality and individual Differences, 47(8):900–905. Layla El Asri, Hannes Schulz, Shikhar Kr Sarma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: a corpus for adding memory to goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 207–219. Beverley Fehr, Deborah Samsom, and Delroy L Paulhus. 2013. The construct of machiavellianism: Twenty years later. Advances in personality assessment, pages 77–116. Ernst Fehr and Klaus M Schmidt. 1999. A theory of fairness, competition, and cooperation. The Quarterly Journal of Economics, 114(3):817–868. Chaim Fershtman. 1990. The importance of the agenda in bargaining. Games and Economic Behavior, 2(3):224–238. Roger Fisher, William L Ury, and Bruce Patton. 2011. Getting to yes: Negotiating agreement without giving in. Penguin. Tommaso Fornaciari and Massimo Poesio. 2012. Decour: a corpus of deceptive statements in italian courts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 1585–1590. Jamie Fraser, Ioannis Papaioannou, and Oliver Lemon. 2018. Spoken conversational ai in video games: Emotional dialogue management increases user engagement. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 179–184. Chris Frith and Uta Frith. 2005. Theory of mind. Current biology, 15(17):R644–R645. Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. Kanishk Gandhi, Dorsa Sadigh, and Noah D. Goodman. 2023. Strategic reasoning with language models. Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1371–1374. 161 Robert H Gass. 2015. Social influence, sociology of. In International Encyclopedia of the Social and Behavioral Sciences, pages 348–354. Elsevier. Howard Giles. 2016. Communication accommodation theory: Negotiating personal relationships and social identities across contexts. Cambridge University Press. Lewis R Goldberg. 1990. An alternative" description of personality": the big-five factor structure. Journal of personality and social psychology, 59(6):1216. William D Grampp. 2000. What did smith mean by the invisible hand? Journal of Political Economy, 108(3):441–465. Jonathan Gratch, David DeVault, Gale M Lucas, and Stacy Marsella. 2015. Negotiation as a challenge problem for virtual humans. In International Conference on Intelligent Virtual Agents, pages 201–215. Springer. HP Grice. 1975. Logic and conversation. Syntax and Semantics, 3:43–58. Chris Guthrie and Dan Orr. 2006. Anchoring, information, expertise, and negotiation: New insights from meta-analysis. Ohio State Journal on Dispute Resolution, 21:597. Rafik Hadfi, Jawad Haqbeen, Sofia Sahab, and Takayuki Ito. 2021. Argumentative conversational agents for online discussions. Journal of Systems Science and Systems Engineering, 30(4):450–464. He He, Derek Chen, Anusha Balakrishnan, and Percy Liang. 2018. Decoupling strategy and generation in negotiation dialogues. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2333–2343. Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272. WANG Hongru, Minda Hu, Yang Deng, Rui Wang, Fei Mi, Weichao Wang, Yasheng Wang, WaiChung Kwan, Irwin King, and Kam-Fai Wong. 2023. Large language models as source planner for personalized knowledge-grounded dialogues. In The 2023 Conference on Empirical Methods in Natural Language Processing. Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191. Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent opendomain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32. Alankar Jain, Florian Pecune, Yoichi Matsuyama, and Justine Cassell. 2018. A user simulator architecture for socially-aware conversational agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 133–140. Jiun-Hao Jhan, Chao-Peng Liu, Shyh-Kang Jeng, and Hung-Yi Lee. 2021. Cheerbots: Chatbots toward empathy and emotionusing reinforcement learning. arXiv preprint arXiv:2110.03949. 162 Changzhen Ji, Xin Zhou, Yating Zhang, Xiaozhong Liu, Changlong Sun, Conghui Zhu, and Tiejun Zhao. 2020. Cross copy network for dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1900–1910. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Emmanuel Johnson, Gale Lucas, Peter Kim, and Jonathan Gratch. 2019a. Intelligent tutoring system for negotiation skills training. In International Conference on Artificial Intelligence in Education, pages 122–127. Springer. Emmanuel Johnson, Sarah Roediger, Gale Lucas, and Jonathan Gratch. 2019b. Assessing common errors students make when negotiating. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pages 30–37. Rishabh Joshi, Vidhisha Balachandran, Shikhar Vashishth, Alan Black, and Yulia Tsvetkov. 2020. Dialograph: Incorporating interpretable strategy-graph networks into negotiation dialogues. In International Conference on Learning Representations. Harold H Kelley. 1996. A classroom study of the dilemmas in interpersonal negotiations. Berkeley Institute of International Studies. Anthony Kelly. 2003. Decision making using game theory: an introduction for managers. Cambridge University Press. Sung Soo Kim, Stan Kaplowitz, and Mark V Johnston. 2004. The effects of physician empathy on patient satisfaction and compliance. Evaluation & the health professions, 27(3):237–251. Olga M Klimecki. 2019. The role of empathy and compassion in conflict resolution. Emotion Review, 11(4):310–325. Eric S Knowles and Jay A Linn. 2004. Resistance and persuasion. Psychology Press. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213. Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. Internet-augmented dialogue generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478. Klaus Krippendorff. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human communication research, 30(3):411–433. 163 Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology. Sage publications. Adarsh Kumar, Peter Ku, Anuj Goyal, Angeliki Metallinou, and Dilek Hakkani-Tur. 2020. Ma-dst: Multi-attention-based scalable dialog state tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8107–8114. Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale M Lucas, and Jonathan Gratch. 2024. Are llms effective negotiators? systematic evaluation of the multifaceted capabilities of llms in negotiation dialogues. arXiv preprint arXiv:2402.13550. Yi-Chieh Lee, Naomi Yamashita, and Yun Huang. 2020. Designing a chatbot as a mediator for promoting deep self-disclosure to a real mental health professional. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1):1–27. Yaniv Leviathan and Yossi Matias. 2018. Google duplex: An ai system for accomplishing realworld tasks over the phone. URL https://ai.googleblog.com/2018/05/duplex-ai-system-for-naturalconversation.html, 3. Roy J Lewicki, Bruce Barry, and David M Saunders. 2016. Essentials of negotiation. McGraw-Hill New York. Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, page 2443. Association for Computational Linguistics. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474. Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. Advances in neural information processing systems, 31. Xinmeng Li, Wansen Wu, Long Qin, and Quanjun Yin. 2021. How to evaluate your dialogue models: A review of approaches. arXiv preprint arXiv:2108.01369. Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688. Yu Li, Kun Qian, Weiyan Shi, and Zhou Yu. 2020. End-to-end trainable non-collaborative dialog system. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8293–8302. Kai-Hui Liang, Patrick L Lange, Yoo Jung Oh, Jingwen Zhang, Yoshimi Fukuoka, and Zhou Yu. 2021. Evaluation of in-person counseling strategies to develop physical activity chatbot for women. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 32–44. 164 Wang Liao, Jingwen Zhang, Yoo Jung Oh, and Nicholas A Palomares. 2021. Linguistic accommodation enhances compliance to charity donation: The role of interpersonal communication processes in mediated compliance-gaining conversations. Journal of Computer-Mediated Communication, 26(3):167–185. Raz Lin, Sarit Kraus, Tim Baarslag, Dmytro Tykhonov, Koen Hindriks, and Catholijn M. Jonker. 2014. Genius: An integrated environment for supporting the design of generic automated negotiators. Computational Intelligence, 30(1):48–70. Houjun Liu. 2021. Towards automated psychotherapy via language modeling. arXiv preprint arXiv:2104.10661. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Gale Lucas, Giota Stratou, Shari Lieblich, and Jonathan Gratch. 2016. Trust me: multimodal signals of trustworthiness. In Proceedings of the 18th ACM international conference on multimodal interaction, pages 5–12. Peng Luo. 2008. Analysis of cultural differences between west and east in international business negotiation. International Journal of Business and Management, 3(11):103–106. Aman Madaan, Katherine Hermann, and Amir Yazdanbakhsh. 2023. What makes chain-of-thought prompting effective? a counterfactual study. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1448–1535. James Manyika and Sissie Hsiao. 2023. An overview of bard: an early experiment with generative ai. AI. Google Static Documents, 2. Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with dialogpt. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235. Johnathan Mell and Jonathan Gratch. 2016. Iago: interactive arbitration guide online. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1510– 1512. Johnathan Mell and Jonathan Gratch. 2017. Grumpy & pinocchio: answering human-agent negotiation questions through realistic agent design. In Proceedings of the 16th conference on autonomous agents and multiagent systems, pages 401–409. Johnathan Mell, Jonathan Gratch, Reyhan Aydoğan, Tim Baarslag, and Catholijn M Jonker. 2019. The likeability-success tradeoff: Results of the 2 nd annual human-agent automated negotiating agents competition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1–7. IEEE. 165 Johnathan Mell, Jonathan Gratch, Tim Baarslag, Reyhan Aydoğan, and Catholijn M Jonker. 2018a. Results of the first annual human-agent league of the automated negotiating agents competition. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, pages 23–28. Johnathan Mell, Gale Lucas, and Jonathan Gratch. 2015. An effective conversation tactic for creating value over repeated negotiations. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1567–1576. Johnathan Mell, Gale Lucas, Sharon Mozgai, Jill Boberg, Ron Artstein, and Jonathan Gratch. 2018b. Towards a repeated negotiating agent that treats people individually: cooperation, social value orientation, & machiavellianism. In Proceedings of the 18th international conference on intelligent virtual agents, pages 125–132. Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. Parlai: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84. John Milnor and Lloyd S Shapley. 1957. On games of survival. Contributions to the Theory of Games, 3:15–45. Shannon Monahan, Emmanuel Johnson, Gale Lucas, James Finch, and Jonathan Gratch. 2018. Autonomous agent that provides automated feedback improves negotiation skills. In International Conference on Artificial Intelligence in Education, pages 225–229. Springer. Burt L Monroe, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403. Ryan O Murphy and Kurt A Ackermann. 2014. Social value orientation: Theoretical and measurement issues in the study of social preferences. Personality and Social Psychology Review, 18(1):13–41. Janice Nadler. 2004. Rapport in negotiation and conflict resolution. Marquette Law Review, 87:875. John F Nash Jr. 1950. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162. Zahra Nazari, Gale M Lucas, and Jonathan Gratch. 2015. Opponent modeling for virtual human negotiators. In International Conference on Intelligent Virtual Agents, pages 39–49. Springer. Huyen Nguyen, Ralph Vente, David Lupea, Sarah Ita Levitan, and Julia Hirschberg. 2021. Acousticprosodic, lexical and demographic cues to persuasiveness in competitive debate speeches. Proc. Interspeech 2021, pages 1034–1038. Jinjie Ni, Tom Young, Vlad Pandelea, Fuzhao Xue, and Erik Cambria. 2023. Recent advances in deep learning based dialogue systems: A systematic survey. Artificial intelligence review, 56(4):3055–3155. 166 Tim Norfolk, Kamal Birdi, and Deirdre Walsh. 2007. The role of empathy in establishing rapport in the consultation: a new model. Medical education, 41(7):690–697. Richard L Oliver, PV Sundar Balakrishnan, and Bruce Barry. 1994. Outcome satisfaction in negotiation: A test of expectancy disconfirmation. Organizational Behavior and Human Decision Processes, 60(2):252–275. OpenAI. 2022. Openai: Introducing chatgpt. OpenAI. 2023. Gpt-4 technical report. Alvaro Ortigosa, José Ignacio Quiroga, and Rosa M Carro. 2011. Inferring user personality in social networks: A case study in facebook. In 2011 11th International Conference on Intelligent Systems Design and Applications, pages 563–568. IEEE. Siru Ouyang, Zhuosheng Zhang, and Hai Zhao. 2021. Dialogue graph modeling for conversational machine reading. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3158–3169. James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001. Denis Peskov, Benny Cheng, Ahmed Elgohary, Joe Barrow, Cristian Danescu-Niculescu-Mizil, and Jordan Boyd-Graber. 2020. It takes two to lie: One to lie, and one to listen. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3811–3854. Richard E Petty and John T Cacioppo. 1977. Forewarning, cognitive responding, and resistance to persuasion. Journal of Personality and social Psychology, 35(9):645. Robin L Pinkley, Margaret A Neale, and Rebecca J Bennett. 1994. The impact of alternatives to settlement in dyadic negotiation. Organizational Behavior and Human Decision Processes, 57(1):97–116. Pooja Prajod, Mohammed Al Owayyed, Tim Rietveld, Jaap-Jan van der Steeg, and Joost Broekens. 2019. The effect of virtual agent warmth on human-agent negotiation. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 71–76. Briony D Pulford, Eva M Krockow, Andrew M Colman, and Catherine L Lawrence. 2016. Social value induction and cooperation in the centipede game. PloS one, 11(3):e0152352. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, Jeffrey Dean, and Sanjay Ghemawat. 2019. Language models are unsupervised multitask learners. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, pages 137–150. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67. 167 Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381. Lisa Rashotte. 2007. Social influence. The Blackwell encyclopedia of sociology. Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, et al. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325. AW Rosenthal and Ariel Rubinstein. 1984. Repeated two-player games with ruin. International Journal of Game Theory, 13:155–177. Derek D Rucker, Zakary L Tormala, and Richard E Petty. 2004. Individual differences in resistance to persuasion: The role of beliefs and meta-beliefs. Resistance and persuasion, page 83. Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. 2018. Carer: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697. Thomas C Schelling. 1958. The strategy of conflict. prospectus for a reorientation of game theory. Journal of Conflict Resolution, 2(3):203–264. Daniel Schulman and Timothy Bickmore. 2009. Persuading users through counseling dialogue with a conversational agent. In Proceedings of the 4th international conference on persuasive technology, pages 1–8. James K Sebenius. 2017. Batna s in negotiation: Common errors and three kinds of “no”. Negotiation Journal, 33(2):89–99. Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018. Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871. Weiyan Shi, Kun Qian, Xuewei Wang, and Zhou Yu. 2019. How to build user simulators to train rl-based dialog systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1990–2000. Zhouxing Shi and Minlie Huang. 2019. A deep sequential model for discourse parsing on multiparty dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7007–7014. Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803. 168 Manjira Sinha and Tirthankar Dasgupta. 2021. Predicting success of a persuasion through joint modeling of utterance categorization. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 3423–3427. Giota Stratou, Rens Hoegen, Gale Lucas, and Jonathan Gratch. 2015. Emotional signaling in a social dilemma: An automatic analysis. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pages 180–186. IEEE. Alice F Stuhlmacher and Amy E Walters. 1999. Gender differences in negotiation outcome: A meta-analysis. Personnel Psychology, 52(3):653–677. Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai, and Yi Zhang. 2022. Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676. Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune bert for text classification? In China National Conference on Chinese Computational Linguistics, pages 194–206. Springer. Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. In Proceedings of the 25th international conference on world wide web, pages 613–624. Michael Tanana, Kevin A Hallgren, Zac E Imel, David C Atkins, and Vivek Srikumar. 2016. A comparison of natural language processing methods for automated coding of motivational interviewing. Journal of substance abuse treatment, 65:43–50. Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing, and Zhiting Hu. 2019. Target-guided open-domain conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5624–5634. Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: determining support or opposition from congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 327–335. Amos Tversky and Daniel Kahneman. 1981. The framing of decisions and the psychology of choice. science, 211(4481):453–458. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399. Paul AM Van Lange, Ellen De Bruin, Wilma Otten, and Jeffrey A Joireman. 1997. Development of prosocial, individualistic, and competitive orientations: theory and preliminary evidence. Journal of personality and social psychology, 73(4):733. 169 Paul AM Van Lange, David De Cremer, Eric Van Dijk, and Mark van Vugt. 2007. 23. self-interest and beyond: basic processes of social interaction. In Social psychology: handbook of basic principles, pages 540–564. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008. John Von Neumann and Oskar Morgenstern. 2007. Theory of games and economic behavior (60th Anniversary Commemorative Edition). Princeton university press. Hongru Wang, Lingzhi Wang, Yiming Du, Liang Chen, Jingyan Zhou, Yufei Wang, and Kam-Fai Wong. 2023. A survey of the evolution of language model-based dialogue systems. arXiv preprint arXiv:2311.16789. Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837. Joseph Weizenbaum. 1966. Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45. Colin R Williams, Valentin Robu, Enrico H Gerding, and Nicholas R Jennings. 2012. Iamhaggler: A negotiation agent for complex environments. In New Trends in Agent-based Complex Automated Negotiations, pages 151–158. Springer. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256. Qingyang Wu, Yichi Zhang, Yu Li, and Zhou Yu. 2021. Alternating recurrent dialog model with large-scale pre-trained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1292–1301. William Wunderle. 2007. How to negotiate in the middle east. Military review, 87(2):33. Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333. Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. 2024. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36. 170 Zhuohan Xie, Trevor Cohn, and Jey Han Lau. 2023. The next chapter: A study of large language models in storytelling. In Proceedings of the 16th International Natural Language Generation Conference, pages 323–351. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244. Yuyu Xu, Pedro Sequeira, and Stacy Marsella. 2017. Towards modeling agent negotiators by analyzing human negotiation behavior. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pages 58–64. IEEE. Toshio Yamagishi, Yoshie Matsumoto, Toko Kiyonari, Haruto Takagishi, Yang Li, Ryota Kanai, and Masamichi Sakagami. 2017. Response time in economic games reflects different types of decision conflict for prosocial and proself individuals. Proceedings of the National Academy of Sciences, 114(24):6394–6399. Atsuki Yamaguchi, Kosui Iwasa, and Katsuhide Fujita. 2021. Dialogue act-based breakdown detection in negotiation dialogues. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 745–757. Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. 2017. Building task-oriented dialogue systems for online shopping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31. Runzhe Yang, Jingxiao Chen, and Karthik Narasimhan. 2021. Improving dialog systems for negotiation with personality modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 681–693. Denis Yarats and Mike Lewis. 2018. Hierarchical text generation and planning for strategic dialogue. In International Conference on Machine Learning, pages 5591–5599. PMLR. Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. 2021. Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 1154–1156. Min-Hsuan Yeh and Lun-Wei Ku. 2021. Lying through one’s teeth: A study on verbal leakage cues. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4504–4510. Dian Yu, Yulia Tyshchuk, Heng Ji, and William Wallace. 2015. Detecting deceptive groups using conversations and network analysis. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL-IJCNLP 2015, pages 857–866. Association for Computational Linguistics (ACL). 171 Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, et al. 2019. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Giorgos Zacharia and Pattie Maes. 2000. Trust management through reputation mechanisms. Applied Artificial Intelligence, 14(9):881–907. Jingwen Zhang, Yoo Jung Oh, Patrick Lange, Zhou Yu, Yoshimi Fukuoka, et al. 2020a. Artificial intelligence chatbot behavior change model for designing artificial intelligence chatbots to promote physical activity and a healthy diet. Journal of medical Internet research, 22(9):e22845. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213. Zheng Zhang, Lizi Liao, Xiaoyan Zhu, Tat-Seng Chua, Zitao Liu, Yan Huang, and Minlie Huang. 2020b. Learning goal-oriented dialogue policy with opposite agent awareness. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 122–132. Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie Huang, and XiaoYan Zhu. 2020c. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences, pages 1–17. Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing. 2023. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations. Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed H Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. 2024. Self-discover: Large language models self-compose reasoning structures. arXiv preprint arXiv:2402.03620. Yiheng Zhou, He He, Alan W Black, and Yulia Tsvetkov. 2019a. A dynamic strategy coach for effective negotiation. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 367–378. Yiheng Zhou, Yulia Tsvetkov, Alan W Black, and Zhou Yu. 2019b. Augmenting non-collaborative dialog systems with explicit semantic and strategic dialog history. In International Conference on Learning Representations. 172 Appendix A The CaSiNo Dataset A.1 Pre-Survey After an internal pilot with 9 participants, the entire CaSiNo dataset was collected on Amazon Mechanical Turk over a period of a month. In total, 846 subjects took part in our data collection study. The statistics presented in this section are based on self-identified demographical attributes and standard ways of collecting personality traits from the literature. We had a highly diverse participant pool, representing different age groups, genders, ethnic backgrounds, and education levels. The mean Age among our participants is 36.97 with a standard deviation of 10.81. One participant was removed from this computation since the age entered was 3, which we believed to be in error. Among the participants, 472 identified themselves as Female, 372 were Male, and 2 belonged to Other category. While most of the participants were White American (625 in count), our study also involved a mix of Asian American, Black or African American, Hispanic or Latino, and Multi-Racial groups, among others. The most common highest level of education was found to be a 4-year Bachelor’s degree (346 participants), although the complete pool represents a mixture 173 of Master and Ph.D. degree holders, 2-year and 4-year college graduates without degrees, and high school graduates, among others. For the personality traits, 364 participants were classified as Proself, 463 as Prosocial, and 19 were unclassified based on their Social Value Orientation1 . The mean scores for the Big5 personality traits were found to be as follows: Agreeableness: 5.27, Conscientiousness: 5.6, Emotional Stability: 4.91, Extraversion: 3.69, Openness to Experiences: 5.04. We use the Ten-Item Personality Inventory (TIPI)2 to compute these attributes, where each of them takes a value between 1 and 7. A.2 Preparation Phase We present the scenario description seen by the participants in Table A.1. Several arguments that the participants come up with are presented in Table A.2. Imagine that you are on a camping trip! Woohoo! Apart from some basic amount of supplies that are provided to everyone, you can collect some additional food packages, water bottles, and firewood, to make your camping trip even better. Since these are limited in quantity, you will have to split these additional packages with your campsite neighbor! Each item will be of either High, Medium, or Low priority for you. Each of them only has an available quantity of 3. You will negotiate with another MTurker by chatting in English, using arguments from your personal experiences to justify why you need additional packages apart from the basic supplies. Try hard to get as many items as you can! Table A.1: The camping scenario description as seen by the participants in our data collection. 1https://static1.squarespace.com/static/523f28fce4b0f99c83f055f2/t/ 56c794cdf8baf3ae17cf188c/1455920333224/Triple+Dominance+Measure+of+SVO.pdf 2https://gosling.psy.utexas.edu/scales-weve-developed/ten-item-personality-measure-tipi/ ten-item-personality-inventory-tipi/ 174 Category Item type Food Water Firewood Personal Care because I’m normally eat more because of my big size I have to take a lot of medicine so hydration is very important I have arthritis and being sure I am warm is important for my comfort. Recreational Need many snacks throughout the day for energy to hike I am a very active camper. I like to hike when I camp and I once ran out of water during a strenuous hike. I like having campfires so I need all the firewood. Group Needs I have two teenage boys who require a lot of food, especially when expending so much energy with all the activities of camping. I need more water because I have more people to keep hydrated and do not have enough. I need more firewood due to having several people join on the trip and needing a bigger fire overall. Emergency Some could have been damaged during the trip. I would need more. our car overheated we had to use the water It may get cold and firewood can be hard to come by at certain campsites. Table A.2: Example arguments that the participants come up with for their individual requirements during the preparation phase. The categories defined are not exhaustive. A.3 Data Post-Processing steps We list the data post-processing and filtering steps below: 1. Removal of incomplete dialogues: During the data collection, many negotiation sessions could not be completed due to one of the participants’ disconnecting in the middle. Any dialogue for which we had missing data, including pre-survey and post-survey responses for both the participants, was removed from the final dataset. 2. Removal of bad quality dialogues: We also removed dialogues where we observed a lack of effort or an irrelevant dialogue between the participants. We removed dialogues where the participants used very short utterances or failed to answer the dummy questions about their own preferences correctly, suggesting a lack of effort. Further, we removed the instances where the participants talked about the MTurk task itself, rather than the negotiation. These cases were identified based on a list of keywords: {‘mturk’, ‘amt’, ‘turns’, 175 ‘messages’, ‘amazon’, ‘10’}. In a few cases, it was possible to retain the complete dialogue structure by just removing a few utterances. Hence, in these cases, we only removed the irrelevant utterances, while retaining the rest of the dialogue and the associated metadata. 3. Tackling inappropriate language use: Rarely, some participants also used inappropriate language in their utterances. These dialogues were identified using the lexicon of English swear words on Wikipedia3 . All these dialogues were also removed from the final dataset. A.4 Participant Feedback Role-playing has been a key technique to teach negotiation skills in classroom settings. One of the key application areas for automated negotiation systems is to augment such exercises by allowing human participants to negotiate with an AI and practice their social skills. To maximize the utility of the system developed using our dataset, we chose the camping scenario, which we expected to be easily relatable for our participants and also for any individual who negotiates with a system developed on our dataset. This is essential to ensure that the collected dialogues are engaging, interesting, and capture the rich personal context of the individuals, albeit in a closed-domain setting. One way to judge whether the participants are able to relate to the scenario is via their feedback after the study. With this in mind, we used a feedback column in the Post-survey and asked several questions to the participants throughout the data collection process. These questions included: 1) How was your overall experience? 2) Were you able to see yourself in the ‘role’ and follow best practices?, 3) Could you relate to camping?, and 4) How helpful was the preparation phase? 3https://en.wiktionary.org/wiki/Category:English_swear_words 176 Based on manual inspection, we observed overall positive feedback for all the above questions. Most of the participants were able to easily relate to camping. They frequently pointed out that the experience was ‘fun’, ‘interesting’, and ‘nice’. Many saw this as an opportunity to talk to someone during these tough times of the pandemic. Several cherry-picked feedback responses which indicate that the participants enjoyed the task as a whole and were in fact able to connect well and engage in the negotiation, have been provided in Table A.3. I could do this all day I am camping right now! My partner had better reasons for needing the firewood I enjoyed talking about camping, I haven’t been in a while. It reminded me of all of the things that I used to do. The best thing I did was ask him what his preferences were. He had no interest in firewood which was my highest priority. Table A.3: A few positive feedback responses that we obtained from the participants during the collection of the CaSiNo dataset. A.5 Correlational Analysis The analysis discussed in Chapter 3 is presented in Tables A.4, A.5, A.6, and A.7. Points-Scored Satisfaction Opp-Likeness Points-Scored 1 .376** .276** Satisfaction .376** 1 .702** Opp-Likeness .276** .702** 1 P.Points-Scored −.092** .105** .132** P.Satisfaction .105** .180** .244** P.Opp-Likeness .132** .244** .344** Table A.4: Pearson Correlation Coefficients (r) between the outcome variables. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the entire CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). 177 Joint Points Integrative potential .425*** Table A.5: Pearson Correlation Coefficient (r) between integrative potential and the joint negotiation performance. *** denotes significance with < 0.001. Joint Points Points-Scored Satisfaction Opp-Likeness P.Points-Scored P.Satisfaction P.Opp-Likeness Prosocial Generic Small-Talk −.022 −.002 .086* .115** −.025 .068 .127** Prosocial About Preferences No-Need −.003 −.066 .035 .023 .063 .083* .089* Elicit-Pref .053 .055 .058 .015 .010 .022 .055 Proself Generic UV-Part −.037 .008 −.051 −.112** −.054 −.131** −.151** Vouch-Fairness −.140** −.084* −.159** −.196** −.090* −.185** −.180** Proself About Preferences Self-Need −.003 .022 −.061 −.065 −.026 −.091* −.086* Other-Need −.176** −.045 −.101** −.118** −.174** −.160** −.113** Table A.6: Pearson Correlation Coefficients (r) for strategy annotation counts with the outcome variables. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the annotated subset of the CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). A.6 Strategy Prediction A.6.1 Architecture We provide some more details on the strategy prediction multi-task architecture in this section. The self-attention layer is itself represented using the BERT encoder architecture, but with a single transformer layer and just one attention head. After the self-attention layer, we first extract the 768 dimensional representation for the [CLS] token. This is passed through a feed-forward network, which converts it to 128 dimensions. The feature embedding is also converted to a 128 dimensional vector using a feed-forward network. Both the above embeddings are then combined using an element-wise summation, which further passes through two feedforward layers with 178 P.Small-Talk P.Self-Need P.Other-Need P.No-Need P.Elicit-Pref P.UV-Part P.Vouch-Fair Small-Talk .769** −.033 .021 .063 −.059 −.012 −.180** Self-Need −.033 .355** .103** .115** −.007 .235** −.088* Other-Need .021 .103** .339** .002 −.067 .159** −.015 No-Need .063 .115** .002 .258** .097** .064 −.116** Elicit-Pref −.059 −.007 −.067 .097** .168** −.097** −.102** UV-Part −.012 .235** .159** .064 −.097** .268** .064 Vouch-Fair −.180** −.088* −.015 −.116** −.102** .064 .287** Table A.7: Pearson Correlation Coefficients (r) between strategy annotation counts. Variables with P. prefix denote the corresponding attributes of the negotiation partner of an individual. These correlations have been computed on the annotated subset of the CaSiNo dataset. * denotes significance with < 0.05 (2-tailed). ** denotes significance with < 0.01 (2-tailed). hidden dimensions of 64 and 1, and a sigmoid layer to finally output the probability for each annotation strategy. A.6.2 Computing Infrastructure All experiments were performed on a single Nvidia Tesla V100 GPU. The training takes two hours to complete for a single model on all the cross-validation folds. A.6.3 Training Details To search for the best hyperparameters, we use a combination of randomized and manual search for the Full model. For each cross-fold, 5% of the training data was kept aside for validation. The metric for choosing the best hyper-parameters is the mean F1 score for the positive class on the validation dataset. The mean is over all the labels and over 5 cross-validation folds. We vary the learning rate in {3 −5 , 4 −5 , 5 −5 }, weight decay in {0.0, 0.01, 0.001} and dropout in {0.0, 0.1, 0.2, 0.3}. The rest of the hyper-parameters were fixed based on the available computational and space resources. We report the best-performing hyper-parameters in Chapter 3, which 179 were used for all the experiments. We report the performance on the validation set corresponding to the chosen hyper-parameters and the number of trainable parameters in Table A.8. Model Overall Validation F1 Trainable Parameters Majority 0.0 0 LR-BoW 49.6 2646.2 (27.2) BERT-FT 69.9 109, 590, 529 Multi-task training Freeze 62.3 221, 361, 031 No Attn 66.6 110, 235, 271 No Feats 77.6 330, 840, 583 Full 78.1 330, 844, 807 +OS 77.9 330, 844, 807 +IDPT 79.6 330, 844, 807 +IDPT+OS 79.6 330, 844, 807 Table A.8: Training details for the strategy prediction task. The Overall F1 scores are for the positive class. For LR-BoW, the exact number of features varies slightly based on the CV split. Hence, we report Mean (Std) across the five splits. A.7 Screenshots from the Data Collection Interface To provide more clarity on the data collection procedure, we present several screenshots from our interface in Figures A.1, A.2, A.3, and A.4. We design the pre-survey using the Qualtrics platform4 . The rest of the data collection is based on the ParlAI framework (Miller et al., 2017). Figure A.1: Screenshots from the data collection interface: Task Preview. This is a brief task description that the MTurkers see before signing up for our data collection task. 4https://www.qualtrics.com/core-xm/survey-software/ 180 (a) Onboarding Phase 1: The first step takes the participant to Qualtrics which collects the demographics, introduces the camping scenario, and gives a tutorial on negotiation best practices. (b) Onboarding Phase 2: In this phase, we explicitly ask the participants to come up with arguments from their past experiences, which justify their preferences. The preference order is randomly assigned by us. This provides a personal context around the negotiation for each participant. Figure A.2: Screenshots from the data collection interface: Participant On-boarding. 181 (a) Chat Interface: The right portion allows two participants to negotiate in English using alternating messages. They also have the option to use emoticons. Once they come to an agreement, one of the participants must enter the exact deal on the left. (b) Response to the Deal: When one of the participants enters the deal, the other gets an option to either accept, reject, or walk away from the deal. In the CaSiNO dataset, a participant walks away in 36 dialogues. Figure A.3: Screenshots from the data collection interface: Chat Interface. 182 Figure A.4: Screenshots from the data collection interface: Post-Survey. Once the deal is accepted (or someone walks away), both the participants are asked to fill in the post-survey having the above questions. The figure contains dummy responses. 183 Appendix B A Case for Emotion-Aware Mixed-Motive Systems B.1 Correlation Among Emotion Dimensions Table B.1 presents the correlations of emotion measures among each other. In general, we find that the observed trends follow our intuition, with positive correlations between dimensions with similar valences, and negative correlations between dimensions with opposite valences. B.2 Lexical Correlates of Emotion We present the top 5 words in each emotion category in Table B.2 along with scaled Z-scores. As evident, most of the word associations with respective emotion dimensions make sense intuitively, further validating our methods for emotion recognition. B.3 Controlling for Objective Performance Tables B.3 and B.4 summarize the results for predicting satisfaction and liking respectively, after controlling for the objective negotiation performance. As discussed in Chapter 4, we continue 184 T5-Twitter T5-Twitter Joy Sadness Anger Joy 1 −.287** −.295** Sadness −.287** 1 .278** Anger −.295** .278** 1 (a) T5-Reddit T5-Reddit Joy Sadness Anger Joy 1 −.100** −.053* Sadness −.100** 1 .210** Anger −.053* .210** 1 (b) T5-Twitter T5-Reddit Joy Sadness Anger Joy .501** −.212** −.087** Sadness −.027 .721** .224** Anger .061** .268** .569** (c) T5-Twitter Emoticon LIWC Joy Sadness Anger Positive Emotions Sadness Anger Joy .061** −.043 −.040 .387** −.126** −.106** Sadness −.009 .213** .086** −.034 .552** .033 Anger −.052* .113** .111** −.185** .114** .169** (d) T5-Reddit Emoticon LIWC Joy Sadness Anger Positive Emotions Sadness Anger Joy .100** -.035 .014 .376** -.034 -.017 Sadness -.020 .202** .061** -.014 .597** .051* Anger -.057** .138** .190** -.118** .052* .118** (e) Table B.1: Correlations of emotion variables among each other. (a) T5-Twitter variables among each other, (b) T5-Reddit variables among each other, (c) T5-Twitter variables against T5-Reddit variables, (d) T5-Twitter variables against Emoticon and LIWC variables, (e) T5-Reddit variables against Emoticon and LIWC variables. We only consider the dimensions that are common across all methods. * denotes < 0.05. ** denotes < 0.01. to see significant improvements with the incorporation of emotion variables in this case as well, attesting to their utility in negotiation dialogues. 185 Joy Sadness Anger Surprise Word Z-score Word Z-score Word Z-score Word Z-score hope 2.89** sorry 2.10* animal 1.36 drive 1.43 buddy 2.72** spilled 1.84 unacceptable 0.98 bargain 1.43 awesome 2.71** n’t 1.82 covid-19 0.84 awful 1.32 great 2.60** reconsider 1.38 word 0.84 function 1.32 thank 2.53* stopped 1.30 joking 0.84 snack 1.23 (a) Positive Emotions Sadness Anger Anxiety Word Z-score Word Z-score Word Z-score Word Z-score good 11.89*** sorry 9.93*** hate 2.08* worried 7.18*** well 11.87*** low 6.16*** greedy 1.58 afraid 4.07*** okay 9.90*** alone 4.43*** critical 1.35 worry 3.5*** great 9.86*** sugar 2.44* tricky 1.19 risk 3.22** ok 9.68*** hear 2.43* bother 1.06 confused 2.54* (b) Joy Sadness Anger Surprise Fear Love Word Z-score Word Z-score Word Z-score Word Z-score Word Z-score Word Z-score you 8.62*** sorry 10.63*** cold 11.5*** funny 1.93 worried 8.72*** hot 6.4*** good 7.68*** unfortunately 4.68*** no 5.76*** interesting 1.89 afraid 7.05*** generous 3.04** great 7.33*** dehydrated 4.03*** need 4.77*** surprise 1.08 concerned 4.48*** hotter 1.89 hello 6.84*** low 3.12*** thirsty 4.63*** ha 1.03 dark 3.88*** lovely 1.56 sounds 6.81*** suffer 2.98** hungry 4.36*** status 0.85 scared 3.58*** liking 1.34 (c) Joy Sadness Anger Surprise Fear Disgust Word Z-score Word Z-score Word Z-score Word Z-score Word Z-score Word Z-score great 9.27*** sorry 13.42*** no 6.51*** how 16.86*** worried 7.14*** awful 0.92 good 7.9*** unfortunately 5.03*** not 5.19*** what 14.88*** afraid 5.77*** contaminated 0.7 sounds 7.47*** hear 3.75*** unfair 3.18** you 12.96*** scared 3.58*** worse 0.39 thank 6.49*** unfortunate 2.5* walk 3.09** about 9.97*** terrible 2.96** settle 0.34 yes 6.09*** bad 2.3* away 2.48* are 7.63*** dark 2.68** leaving 0.29 (d) Table B.2: Top 5 words based on the log-odds ratio for (a) Emoticon (b) LIWC, (c) T5-Twitter, and (d) T5-Reddit methods. * denotes < 0.05. ** denotes < 0.01. *** denotes < 0.001. 186 Variables R 2 df F R 2 Change F Change Emoticons Points Scored .146 (1, 2010) 343.88*** − − +Individual Differences .172 (15, 1996) 27.70*** .026 4.52*** +Participant Affect .191 (19, 1992) 24.76*** .019 11.51*** +Partner Affect .199 (23, 1988) 21.52*** .008 5.19*** LIWC Points Scored .146 (1, 2010) 343.88*** − − +Individual Differences .172 (15, 1996) 27.70*** .026 4.52*** +Participant Affect .178 (19, 1992) 22.77*** .006 3.71** +Partner Affect .191 (23, 1988) 20.43*** .013 7.82*** T5-Twitter Points Scored .146 (1, 2010) 343.88*** − − +Individual Differences .172 (15, 1996) 27.70*** .026 4.52*** +Participant Affect .223 (21, 1990) 27.24*** .051 21.75*** +Partner Affect .241 (27, 1984) 23.34*** .018 7.77*** T5-Reddit Points Scored .146 (1, 2010) 343.88*** − − +Individual Differences .172 (15, 1996) 27.70*** .026 4.52*** +Participant Affect .225 (21, 1990) 27.59*** .053 22.77*** +Partner Affect .245 (27, 1984) 23.78*** .019 8.32*** Table B.3: Regression results for predicting Satisfaction after controlling for the points scored by the participants. ** denotes < 0.01. *** denotes < 0.001. 187 Variables R 2 df F R 2 Change F Change Emoticons Points Scored .077 (1, 2010) 167.38*** − − +Individual Differences .121 (15, 1996) 18.39*** .045 7.23*** +Participant Affect .152 (19, 1992) 18.82*** .031 18.10*** +Partner Affect .165 (23, 1988) 17.14*** .013 7.88*** LIWC Points Scored .077 (1, 2010) 167.38*** − − +Individual Differences .121 (15, 1996) 18.39*** .045 7.23*** +Participant Affect .133 (19, 1992) 16.05*** .011 6.51*** +Partner Affect .146 (23, 1988) 14.78*** .013 7.71*** T5-Twitter Points Scored .077 (1, 2010) 167.38*** − − +Individual Differences .121 (15, 1996) 18.39*** .045 7.23*** +Participant Affect .216 (21, 1990) 26.18*** .095 40.22*** +Partner Affect .251 (27, 1984) 24.62*** .035 15.25*** T5-Reddit Points Scored .077 (1, 2010) 167.38*** − − +Individual Differences .121 (15, 1996) 18.39*** .045 7.23*** +Participant Affect .240 (21, 1990) 29.86*** .118 51.56*** +Partner Affect .284 (27, 1984) 29.14*** .044 20.47*** Table B.4: Regression results for the prediction of Liking after controlling for the points scored by the participants. *** denotes < 0.001. 188 Appendix C Data Augmentation Improves Partner Modeling C.1 Experimental Design C.1.1 Computing Infrastructure All experiments were performed on a single Tesla V100 GPU. The complete model (CD + CA + DND) takes around 10 hours for training with 32-bit precision on a single cross-validation fold with a batch size of 25. C.1.2 Training Details We used a combination of randomized and manual search to tune the hyperparameters. For each cross fold, we kept 50 dialogues from the CD training data for parameter tuning. This amounts to 100 data points, considering the two perspectives extracted from each dialogue. The metric for choosing the best hyperparameters is EMA at =5, averaged over the 5 cross-validation folds. We tuned the parameters on the performance of the BERT-based model with CD + CA + DND configuration. 189 We vary the learning rate in {1 −5 , 2 −5 , 3 −5 }, dropout in {0.0, 0.1, 0.2}, and loss-specific dropout in {0.0, 0.15, 0.25}. We also varied the number of transformer layers in the Level II encoder from Figure 5.2 in the set {1, 2, 3}. For DND, we also varied the number of instances that were chosen for adaptation but found that using all the instances that passed our filtering gave the best performance. We further varied the margin for ranking loss in {0.0, 0.3, 0.5}. Finally, for the models trained on combined datasets, we tried with a higher weightage (2x) for the loss contribution of CA-adapted instances due to their lower total count but found no visible improvements in the performance. The rest of the hyper-parameters were fixed based on the available computational and space resources. We report the best-performing hyper-parameters in Chapter 5. The models used in Chapter 5 have nearly 171 million trainable parameters. We report the mean performance on the validation set in Table C.1. C.1.3 External Packages and Frameworks The models were developed in PyTorch Lightning1 and relied on the HuggingFace Transformers library2 for using the pretrained models and their corresponding tokenizers. We used a number of external packages such as Python Scikit Learn3 library for implementing the evaluation metrics, and NLTK4 for tokenization for the Bag-of-Words model. 1https://www.pytorchlightning.ai/ 2https://github.com/huggingface/transformers 3https://scikit-learn.org/stable/modules/model_evaluation.html 4https://www.nltk.org/api/nltk.tokenize.html 190 Model EMA Random 17.8 (4.87) BoW-Ranker 35 (3.35) Bert-based DND 51 (1.67) CA + DND 51.2 (3.12) CD 63.6 (4.84) CD + CA 65.8 (1.94) CD + DND 69 (2.28) CD + CA + DND 70 (2.61) RoBerta-based DND 54.6 (5.43) CA + DND 55 (5.55) CD 70.2 (3.19) CD + CA 70 (3.95) CD + DND 75.6 (2.15) CD + CA + DND 77.8 (2.32) Table C.1: Validation performance for partner modeling on CD dataset. The reported EMA is at =5. The numbers represent Mean (Std.) over 5-cross folds of the CD data. C.2 Regular Expression Usage C.2.1 Adapting DealOrNoDeal data We randomly mapped book from DealOrNoDeal to food, replacing all occurrences of ‘book’ and ‘books’ with ‘food’ in the utterances. Similarly, hat was mapped to water, and ball was mapped to firewood. Since the dialogues only involve minimal context about the issues, we found these replacements to be sufficient. 191 C.2.2 Identifying Offer Statements The offer statements were also recognized by regular expressions for the purpose of computing average attention scores. Specifically, an utterance is classified as having an offer, if it contains 3 or more of the following phrases - {’0’, ’1’, ’2’, ’3’, ’one’, ’two’, ’three’, ’all the’, ’food’, ’water’, ’firewood’, ’i get’, ’you get’, ’what if’, ’i take’, ’you can take’, ’can do’}. The threshold 3 and these phrases were chosen heuristically via qualitative analysis. C.3 Comparison with Human Performance 1 2 3 4 5 Number of Utterances Seen 40 45 50 55 60 65 70 75 Exact Match Accuracy (EMA) Best Performing Model Human Expert Figure C.1: Mean performance comparison for the best-performing model with the human expert for different values of . We present the performance for our best-performing model with the human expert across different values of in Figure C.1. 192 Appendix D Be Selfish, But Wisely: Designing Mixed-Motive Dialogue Models D.1 Problematic Self-Interested Behavior Figure D.1 shows a sample dialogue between two self-interested RL agents based on the standard self-play RL training procedure followed in prior work (Lewis et al., 2017; He et al., 2018). To the best of our knowledge, such an analysis has not been shown in the past. The dialogue provides evidence that the agents fail to learn about the value of compromise in a negotiation. Since both agents are unwilling to make concessions for their partner, they simply get stuck. D.2 Human Evaluation Setup Crowdsourcing platform: We performed our initial experiments on Amazon Mechanical Turk (AMT)1 . However, we faced severe quality issues even with strict worker qualifications and 1https://www.mturk.com/ 193 Context (Alice & Bob: Standard RL Agents) Counts Book = 1, Hat = 2, Ball = 2 Alice Values Book = 2, Hat = 3, Ball = 1 Bob Values Book = 2, Hat = 1, Ball = 3 Dialogue Alice i would like the hats and the balls . Bob i can take the balls but i need the book or both hats Alice i need the book and at least one other item Bob i can not make that deal . i need the book and at least 1 hat or a ball ... ... Bob i can’t do that if i get the book , you can have the rest Turn limit reached Output Alice <no_agreement> Bob <no_agreement> Reward Alice 0/10 Bob 0/10 Figure D.1: A sample negotiation dialogue between two copies of the standard RL agent based on Lewis et al. (2017). The task here is to divide the available books, hats, and balls between the two players. In this case, the agents get stuck – both continuously asking what they want without looking for a compromise. extensive checks in place. We eventually collected our entire data on the Prolific crowdsourcing platform2 , where we could receive a relatively much better data quality. Study Design: Our study involved a pre-survey based on Qualtrics3 which included attention checks and a personality test to measure the Social Value Orientation (Van Lange et al., 1997) of the human participants (Prosocial vs Proself)4 . However, in our study, we observed no significant differences among the agents’ performances when interacting with Prosocial or Proself 2https://www.prolific.co/ 3https://www.qualtrics.com/ 4https://static1.squarespace.com/static/523f28fce4b0f99c83f055f2/t/ 56c794cdf8baf3ae17cf188c/1455920333224/Triple+Dominance+Measure+of+SVO.pdf 194 human partners. We also included a mini-tutorial to prepare the participants for their upcoming negotiation with a randomly chosen agent. The main negotiation task was set up using the LIONESS framework5 , which was hosted on AWS6 using a Bitnami LAMP stack7 . We provide a screenshot from the task in Figure D.2. After the negotiation, we used a post-survey to gather the participants’ subjective perceptions. For satisfaction, we asked “How satisfied are you with the negotiation outcome?”, and for likeness, we asked “How much do you like your partner?”. We used a 5-point Likert scale for both questions, from Extremely dissatisfied (dislike) to Extremely satisfied (like). For the statistical analysis presented in Section 6.4, we codified this scale from 1.0 to 5.0, considering both of these metrics as continuous measures. Figure D.2: Screenshot from the human evaluation study. The participants first negotiate with a randomly assigned dialogue agent using the chat feature on the right side. Once an agreement is reached, the participant is asked to enter the agreed deal using the options on the left. The participant was also allowed to walk away from the conversation without agreement. The participant was allowed to submit a deal or walk away after at least one turn. 5https://lioness-lab.org/ 6https://aws.amazon.com/ 7https://bitnami.com/stack/lamp/cloud 195 Worker Qualifications: Each worker was only allowed to participate once in the study. The worker pool was restricted to the USA, with English as the native language, a minimum approval rate of 90%, and at least 500 minimum number of submissions. Worker Payments: The workers were paid at the rate of $12 per hour. The expected time to complete the study was 10 minutes, resulting in a base pay of 2. In addition, the workers were entered into a lottery where we awarded $10 to 15 randomly selected workers. A worker’s chances of winning the lottery depended on their performance and effort put into the task. Post-Processing: For nearly 30 % of the cases, the final deal entered by the human or the agent did not match. However, this disagreement did not mean a disagreement in the negotiation. Instead, this was primarily due to either an error by the model, the human worker, or both (occurs rarely). Hence, we post-processed the data to fix these instances. This was done manually by the authors of the paper (that is, experts knowledgeable about the task). For each instance, either the agreed deal was identified, or the instance was discarded from evaluation if the agreed deal was completely unclear (occurs rarely). Statistics: We summarized the statistics in Table D.1. We collected nearly 100 agent-human conversations for every dialogue model. In general, we find the conversations to be longer between humans and selfish agents (and more number of words per utterance from the selfish agents), as compared to the fair ones. This is probably due to the selfish agents negotiating harder for high-value deals for themselves. 196 Model # of dialogues Avg. # of utterances/dialogue Avg. # of words/utterance M p=S r=fair 97 4.35 5.47 M p=fair r=fair 100 4.71 5.50 M p=selfish r=fair 97 4.91 5.31 M p=S r=selfish 96 8.74 6.94 M p=fair r=selfish 90 6.50 6.75 M p=selfish r=selfish 103 6.29 6.00 Table D.1: Statistics from the human evaluation study. D.3 Sample Conversations from Human Evaluation We show sample conversations from the human evaluation for all dialogue agents in Tables D.2 to D.7. Context Counts Book = 3, Hat = 1, Ball = 2 Model Values Book = 2, Hat = 2, Ball = 1 Human Values Book = 0, Hat = 8, Ball = 1 Dialogue Human Hi there. How about you get both the balls, I get 2 books and 1 hat? Model deal Human <dealselection> Output Model Book = 1, Hat = 0, Ball = 2 Human Book = 2, Hat = 1, Ball = 0 Reward Model 4/10 Human 8/10 Table D.2: Sample human evaluation output for =fair =fair . 197 Context Counts Book = 1, Hat = 4, Ball = 2 Model Values Book = 4, Hat = 0, Ball = 3 Human Values Book = 0, Hat = 2, Ball = 1 Dialogue Model i need the book and 2 balls Human Not possible, I need 1 ball and 2 hats Model ok Human Ok, deal. Model <dealselection> Output Model Book = 1, Hat = 2, Ball = 1 Human Book = 0, Hat = 2, Ball = 1 Reward Model 7/10 Human 5/10 Table D.3: Sample human evaluation output for =selfish =fair . Context Counts Book = 2, Hat = 3, Ball = 1 Model Values Book = 0, Hat = 1, Ball = 7 Human Values Book = 2, Hat = 2, Ball = 0 Dialogue Model i would love the ball and the books Human Okay I want all the hats then. Model <dealselection> Output Model Book = 2, Hat = 0, Ball = 1 Human Book = 0, Hat = 3, Ball = 0 Reward Model 7/10 Human 6/10 Table D.4: Sample human evaluation output for = =fair. 198 Context Counts Book = 2, Hat = 2, Ball = 3 Model Values Book = 3, Hat = 2, Ball = 0 Human Values Book = 2, Hat = 0, Ball = 2 Dialogue Model i would like the hats and books . Human I want the two books, you take the rest. Model i will take the balls and books Human sorry, that’s no deal. Model <dealselection> Human <walkaway> Output Model <walkaway> Human <walkaway> Reward Model 0/10 Human 0/10 Table D.5: Sample human evaluation output for = =selfish. Context Counts Book = 1, Hat = 1, Ball = 4 Model Values Book = 8, Hat = 2, Ball = 0 Human Values Book = 5, Hat = 1, Ball = 1 Dialogue Model i would like the book and the hat . Human Hey! You would like one book and the hat? Model i would like the book and the hat . Human I will give you the book, but I’d like the hat and the balls to make sure we both get 5. Model i could use the book and the hat . Human Do you agree? Model <dealselection> Human <walkaway> Output Model <walkaway> Human <walkaway> Reward Model 0/10 Human 0/10 Table D.6: Sample human evaluation output for =fair =selfish. 199 Context Counts Book = 1, Hat = 4, Ball = 2 Model Values Book = 0, Hat = 2, Ball = 1 Human Values Book = 4, Hat = 0, Ball = 3 Dialogue Human hi, what items do you need? Model i want the hats and the balls Human I can give you all the hats only Model i need the hats and the book Human I can only offer the hats Model i can’t do that Human I can give you all the hats and one ball Model i can’t do that . i need the hats and at least 1 ball Human yeah i can do that Model <dealselection> Output Model Book = 1, Hat = 0, Ball = 1 Human Book = 0, Hat = 4, Ball = 1 Reward Model 9/10 Human 7/10 Table D.7: Sample human evaluation output for =selfish =selfish . 200 Appendix E Evaluating Large Language Models in Mixed-Motive Contexts E.1 Details of Negotiation Tasks We present an overview of all the designed tasks in Table E.1. Table E.2 summarizes the distribution of tasks across datasets and time stages. In some cases, a specific model failed to follow the provided instructions for a specific task and hence, failed to produce reliable results. Such tasks were entirely excluded (for all models) while computing the average bar plots in Figure 7.3 for a fair comparison. Table E.3 presents the final list of selected tasks used for computing these plots. Finally, we report the task-wise results for all models in Table E.4 for completeness. E.2 Prompting The precise versions used for GPT-3.5 and GPT-4 are ‘gpt-3.5-turbo-1106’ and ‘gpt-4-1106-preview’, respectively. Tables E.5, E.6, and E.7 provide examples of prompts for specific tasks corresponding to the start, during, and end time stages respectively. Additionally, Table E.8 showcases an example 201 Dataset Time Stage Full Task Name Task Type CA Start sta_ask_high_priority_ca Comprehension CA Start sta_ask_low_priority_ca Comprehension CA Start sta_ask_point_values_ca Comprehension CA Start sta_max_points_ca Comprehension CA Start sta_total_item_count_ca Comprehension CA During dur_partner_ask_high_priority_ca Partner Modeling CA During dur_partner_ask_low_priority_ca Partner Modeling CA During dur_strategy_ca Annotation CA During dur_gen_resp_ca Generation CA During dur_ask_high_priority_ca Comprehension CA During dur_ask_low_priority_ca Comprehension CA End end_deal_likeness_ca Comprehension CA End end_deal_satisfaction_ca Comprehension CA End end_deal_specifics_ca Comprehension CA End end_deal_total_ca Comprehension CA End end_partner_deal_likeness_ca Partner Modeling CA End end_partner_deal_satisfaction_ca Partner Modeling CRA During dur_dial_act_cra Annotation CRA During dur_full_proposal_cra Annotation DND Start sta_ask_point_values_dnd Comprehension DND Start sta_max_points_dnd Comprehension DND Start sta_total_item_count_dnd Comprehension DND During dur_dial_act_dnd Annotation DND During dur_full_proposal_dnd Annotation DND During dur_gen_resp_dnd Generation DND End end_deal_specifics_dnd Comprehension DND End end_deal_total_dnd Comprehension JI Start sta_ask_high_priority_ji Comprehension JI Start sta_ask_low_priority_ji Comprehension JI During dur_dial_act_ji Annotation JI During dur_partner_ask_high_priority_ji Comprehension JI During dur_partner_ask_low_priority_ji Comprehension JI During dur_ask_high_priority_ji Comprehension JI During dur_ask_low_priority_ji Comprehension JI End end_deal_specifics_ji Comprehension Table E.1: An overview of the designed tasks for evaluating LLMs in negotiations. CRA: (DeVault et al., 2015), DND: (Lewis et al., 2017), CA: (Chawla et al., 2021b), JI: (Yamaguchi et al., 2021). of CoT prompting used for the end_deal_total_ca task. Table E.9 shows the questions used in prompts for all tasks. 202 Dataset Negotiation Stage Total Start During End CA 5 6 6 17 CRA 2 2 DND 3 3 2 8 JI 2 5 1 8 Total 10 16 9 35 Table E.2: Distribution of the designed tasks by the dataset and time stage. Task Types Task Names Comprehension (Start) sta_max_points_ca, sta_max_points_dnd, sta_total_item_count_ca, sta_total_item_count_dnd, sta_ask_high_priority_ji, sta_ask_low_priority_ji Comprehension (End) end_deal_specifics_ca, end_deal_specifics_dnd, end_deal_total_ca, end_deal_total_dnd Comprehension (Subjective) end_deal_satisfaction_ca Annotation (During) dur_dial_act_cra, dur_dial_act_ji, dur_strategy_ca Partner Modeling (During) dur_partner_ask_high_priority_ca, dur_partner_ask_low_priority_ca Partner Modeling (Subjective) end_deal_satisfaction_ca, end_deal_likeness_ca Table E.3: Selected tasks for computing the average bar plots in Figure 7.3. E.3 Baselines We use two baselines in this work: a Majority baseline, that trivially outputs the majority answer from the training data, and a Flan-T5 fine-tuning baseline, where we train a model for each task separately by fine-tuning on the available training data. Task Selection: For all tasks that are defined in the Start time stage, it is possible to achieve a 100% accuracy through trivial rule-based parsers. Hence, we excluded these tasks for evaluation with Majority and Flan-T5. Data Preprocessing: During the data preprocessing for fine-tuning, although there are differences in the information included in the prompts for each task, prompts for evaluating LLMs are generally 203 Full Task Name Metric Model Majority Flan-T5 GPT-3.5 GPT-4 Mistral7b Vicuna13b Vicuna33b Wizard13b end_deal_likeness_ca Accuracy 0.525 0.525 0.357 0.175 0.119 0.267 0.239 end_deal_satisfaction_ca Accuracy 0.5 0.467 0.373 0.417 0.092 0.266 0.216 0.445 end_deal_specifics_ca Accuracy 0.356 0.364 0.664 0.916 0.517 0.517 0.593 0.555 end_deal_total_ca Accuracy 0.142 0.233 0.158 0.083 0.15 0.05 0.017 0.017 end_partner_deal_likeness_ca Accuracy 0.517 0.517 0.31 0.308 0.133 0.167 0.178 0.282 end_partner_deal_satisfaction_ca Accuracy 0.433 0.492 0.426 0.517 0.13 0.271 0.083 0.345 dur_ask_high_priority_ca Accuracy 0.742 0.9 0.558 0.375 0.345 dur_ask_low_priority_ca Accuracy 0.533 0.75 0.358 0.286 0.269 dur_partner_ask_high_priority_ca Accuracy 0.292 0.717 0.7 0.792 0.483 0.42 0.353 0.392 dur_partner_ask_low_priority_ca Accuracy 0.325 0.717 0.517 0.692 0.433 0.306 0.357 0.333 dur_strategy_ca F1 0.055 0.724 0.463 0.507 0.265 0.381 0.304 0.254 sta_ask_high_priority_ca Accuracy 1 1 0.667 sta_ask_low_priority_ca Accuracy 1 1 0.5 0.4 sta_ask_point_values_ca F1 1 1 1 1 1 1 sta_max_points_ca Accuracy 0.333 0.333 0.5 0 0 0 sta_total_item_count_ca Accuracy 1 1 1 1 1 0.333 dur_dial_act_cra F1 0.067 0.787 0.535 0.678 0.35 0.338 0.518 0.302 dur_full_proposal_cra Accuracy 0.359 0.439 0.352 0.369 0.241 0.262 0.245 0.325 end_deal_specifics_dnd Accuracy 0.454 0.973 0.67 0.949 0.558 0.631 0.558 0.628 end_deal_total_dnd Accuracy 0.257 0.832 0.381 0.664 0.23 0.319 0.221 0.336 dur_dial_act_dnd F1 0.888 0.96 0.735 0.825 0.764 0.639 0.337 dur_full_proposal_dnd Accuracy 0.39 1 0.742 0.866 0.648 0.748 0.725 0.687 sta_ask_point_values_dnd Accuracy 0.993 1 1 1 0.752 1 sta_max_points_dnd Accuracy 0.317 0.337 0.366 0.495 0.307 0.386 sta_total_item_count_dnd Accuracy 0.95 1 0.98 0.505 0.901 0.465 end_deal_specifics_ji Accuracy 0.261 0.764 0.782 0.858 0.733 0.8 0.785 0.766 dur_ask_high_priority_ji Accuracy 0.495 0.862 0.37 0.233 0.252 0.259 dur_ask_low_priority_ji Accuracy 0.67 0.917 0.333 0.26 0.306 0.296 dur_dial_act_ji F1 0.058 0.019 0.578 0.688 0.387 0.452 0.468 0.414 dur_partner_ask_high_priority_ji Accuracy 0.165 0.202 0.193 0.198 0.204 0.204 dur_partner_ask_low_priority_ji Accuracy 0.193 0.266 0.202 0.269 0.176 0.157 0.13 sta_ask_high_priority_ji Accuracy 0.78 0.89 0.505 0.155 0.211 0.596 sta_ask_low_priority_ji Accuracy 0.761 0.972 0.468 0.174 0.202 0.367 Table E.4: Task-wise results for all models. Empty values for Majority and Flan-T5 correspond to the tasks on which these baselines were not evaluated (since a simple rule-based baseline can achieve 100% performance in these cases). Empty values in all other cases correspond to scenarios where the LLM model failed to follow task-specific instructions, making the measured scores too unreliable to report. Notes: 1) The results on response generation are provided in Table 7.2, and 2) For the cases where the models guess the outputs for each issue (like books, balls, or hats in DND) separately, we simply report the average score across all issues. The results for generation tasks (i.e., dur_gen_resp_ca, dur_gen_resp_dnd) are presented in Table 7.2. 204 Prompt example (Task: sta_ask_point_values_ca) Task Description: You are negotiating with your campsite neighbor over an extra supply of food, water, and firewood for your camping trip. Different types of packages are worth different amounts of points to each one of you. You’ll be provided with information about the negotiation. Then, you’ll answer a question. Here are the number of food, water, and firewood packages available in the negotiation, contained in <count>tags. <count>Food Packages: 3 Water Packages: 3 Firewood Packages: 3 </count> Here are the number of points you get for each type of package, contained in <value>tags. <value>Each Food Package: 3 points Each Water Package: 5 points Each Firewood Package: 4 points </value> Question: How many points is one package of each issue worth to you? Present your answer as a json within <answer></answer>tags with keys as issues (food, water, and firewood) and values as the corresponding answers. Table E.5: Prompt example for the sta_ask_point_values_ca task. Prompt example (Task: dur_full_proposal_dnd) Task Description: You are negotiating with a partner over some quantity of books, hats, and balls to determine who gets which items. Different types of items are worth different amount of points to each one of you. You’ll be provided with information about the negotiation. Then, you’ll answer a question. Here are the number of books, hats, and balls available in the negotiation, contained in <count>tags. <count>Books: 3 Hats: 1 Balls: 2 </count> Here are the number of points you get for each type of item, contained in <value>tags. <value>Each Book: 1 points Each Hat: 5 points Each Ball: 1 points </value> Here is an utterance from the negotiation, contained in <utterance>tags. <utterance>YOU: i’ll take the hat and balls if you want the books </utterance> Question: How many items does the speaker get for each issue in the proposal delimited by the <utterance>tags? Present your answer as a json within <answer></answer>tags with keys as issues (books, hats, and balls) and values as the corresponding answers. If the answer is not clear for an issue, output NA. Table E.6: Prompt example for the dur_full_proposal_dnd task. lengthy due to detailed task descriptions. To enhance training efficiency during fine-tuning, we removed these lengthy descriptions from the inputs to the Flan-T5 model and instead replaced them with simple task instructions such as ‘predict highest priority’, similar to using brief descriptions when typically pretraining Flan-T5 in a multi-task setting. 205 Prompt example (Task: end_deal_specifics_ca) Task Description: You are negotiating with your campsite neighbor over extra supply of food, water, and firewood for your camping trip. Different types of packages are worth different amount of points to each one of you. You’ll be provided with information about the negotiation. Then, you’ll answer a question. Here are the number of food, water, and firewood packages available in the negotiation, contained in <count>tags. <count>Food Packages: 3 Water Packages: 3 Firewood Packages: 3 </count> Here are the number of points you get for each type of package, contained in <value>tags. <value>Each Food Package: 3 points Each Water Package: 5 points Each Firewood Package: 4 points </value> Here is the complete dialogue, contained in <dialogue>tags. <dialogue>THEM: Hello, I would like to have three packages of food. We’ve decided to stay an extra night but need more food to do so. YOU: I would be open to that if you could give me three packages of water , THEM: Hmmm...I’m pretty muddy due to clumsiness, so I may need one extra. I could give you two waters and all of the firewood. What do you think? , YOU: So are you suggesting that I would get 2 waters, 3 firewood, and no food? THEM: Right! Well, beyond the food you already have. YOU: I have an extra person camping with us that I didn’t expect when I bought food, so I could use one if you’re willing , THEM: I understand that! I wasn’t expecting to stay an extra night, but the weather is too perfect to leave. I can manage with two packages of food for sure. , YOU: Great! Thank you for being so understanding! THEM: No problem! So are we in agreement that I get 2 food, 1 water and you get the reverse? I could also probably use one firewood, but it’s not as important to me. YOU: I can give you one firewood, so I’ll be getting 1 food, 2 water, and 2 firewood? </dialogue> Question: In the final deal, how many item of each issue did you get? Present your answer as a json within <answer></answer>tags with keys as issues (food, water, and firewood) and values as the corresponding answers. If there was no agreement, answer NA for each issue. Table E.7: Prompt example for the end_deal_specifics_ca task. After applying the previously described preprocessing methods, we constructed the final dataset and divided it for training and evaluation purposes into train, dev, and test sets with a ratio of 8:1:1, respectively. Multi-task Training In line with how Flan-T5 was originally trained, we conducted our experiments in two ways: 1) Training a single model on all tasks together (FT-5-All-Task) and 2) Training one model for each individual task separately (FT-5-By-Task). The results showed that the FT-5-All-Task model generally underperformed compared to the FT-5-By-Task models, with observed poor learning in several tasks. Consequently, FT-5-By-Task models were chosen as baseline. 206 Prompt example (Task: end_deal_total_ca) Task Description: You are negotiating with your campsite neighbor over extra supply of food, water, and firewood for your camping trip. Different types of packages are worth different amount of points to each one of you. You’ll be provided with information about the negotiation. Then, you’ll answer a question. Here are the number of food, water, and firewood packages available in the negotiation, contained in <count> tags. <count> Food Packages: 3 Water Packages: 3 Firewood Packages: 3 </count> Here are the number of points you get for each type of package, contained in <value> tags. <value> Each Food Package: 3 points Each Water Package: 5 points Each Firewood Package: 4 points </value> Here is the complete dialogue, contained in <dialogue> tags. <dialogue> THEM: Hello, I would like to have three packages of food. We’ve decided to stay an extra night but need more food to do so. YOU: I would be open to that if you could give me three packages of water THEM: Hmmm...I’m pretty muddy due to clumsiness, so I may need one extra. I could give you two waters and all of the firewood. What do you think? YOU: So are you suggesting that I would get 2 waters, 3 firewood, and no food? THEM: Right! Well, beyond the food you already have. YOU: I have an extra person camping with us that I didn’t expect when I bought food, so I could use one if you’re willing THEM: I understand that! I wasn’t expecting to stay an extra night, but the weather is too perfect to leave. I can manage with two packages of food for sure. YOU: Great! Thank you for being so understanding! THEM: No problem! So are we in agreement that I get 2 food, 1 water and you get the reverse? I could also probably use one firewood, but it’s not as important to me. YOU: I can give you one firewood, so I’ll be getting 1 food, 2 water, and 2 firewood? </dialogue> Question: How many points did you get at the end of the negotiation? NOTE: Let’s think step-by-step! Put your thoughts in <thinking> </thinking> tags, and put your answer as a single number in <answer> </answer> tags. Table E.8: CoT prompt example for the end_deal_total_ca task. Compute Resources: We trained the baseline model (Flan-T5) for each task under various experimental settings and hyperparameter adjustments, utilizing over 500 hours of GPUs such as NVIDIA V100 and A100 GPUs. Implementation: For fine-tuning Flan-T5, we used the released model1 on the Hugging Face model hub as a back-bone model. The evaluation code for model assessment was developed from 1https://huggingface.co/google/flan-t5-base 207 Task Question sta_total_item_count_dnd sta_total_item_count_ca What is the total number of items being negotiated over? Present your answer as a single number with no additional text. sta_max_points_dnd sta_max_points_ca What is the maximum number of points that you can possibly get in any deal? Present your answer as a single number with no additional text. sta_ask_point_values_dnd How many points is one item of each issue worth to you? Present your answer as a JSON within <answer></answer>tags with keys as issues (books, hats, and balls) and values as the corresponding answers. sta_ask_point_values_ca How many points is one package of each issue worth to you? Present your answer as a JSON within <answer></answer>tags with keys as issues (food, water, and firewood) and values as the corresponding answers. sta_ask_low_priority_ji dur_ask_low_priority_ji What is your lowest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: position / B: company / C: salary / D: days_off / E: workplace sta_ask_low_priority_ca dur_ask_low_priority_ca What is your lowest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: food / B: water / C: firewood sta_ask_high_priority_ji dur_ask_low_priority_ca What is your highest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: position / B: company / C: salary / D: days_off / E: workplace sta_ask_high_priority_ca dur_ask_high_priority_ca What is your highest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: food / B: water / C: firewood dur_strategy_ca Which negotiation strategies are employed in the utterance? Present your answer as a comma-separated list of strategies, contained in <answer></answer>tags with no additional text. dur_partner_ask_low_priority_ji What is the recruiter’s lowest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: position / B: company / C: salary / D: days_off / E: workplace dur_partner_ask_low_priority_ca What is your partner’s lowest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: food / B: water / C: firewood dur_partner_ask_high_priority_ji What is the recruiter’s highest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: position / B: company / C: salary / D: days_off / E: workplace dur_partner_ask_high_priority_ca What is your partner’s highest priority issue? Present your answer as one of the following multiple choice options. You must select an option. A: food / B: water / C: firewood dur_gen_resp_dnd dur_gen_resp_ca Given the recent dialogue history inside <dialogue>tags, generate your next response in the negotiation concisely, following a similar style as previous utterances. dur_full_proposal_dnd How many items does the speaker get for each issue in the proposal delimited by the <utterance>tags? Present your answer as a JSON within<answer></answer>tags with keys as issues (books, hats, and balls) and values as the corresponding answers. If the answer is not clear for an issue, pick your best guess. dur_full_proposal_cra How many items does the speaker get for each issue in the proposal delimited by the <utterance>tags? Present your answer as a JSON within<answer></answer>tags with keys as issues (painting, lamp, and record) and values as the corresponding answers. If the answer is not clear for an issue, output NA. dur_dial_act_ji dur_dial_act_cra Which dialogue acts are employed in the utterance delimited by the<utterance>tags? Present your answer as a Python list of the relevant options. At least one option applies. dur_dial_act_dnd Which dialogue act is employed in the utterance contained in<utterance>tags? Present your answer as a single word. end_partner_deal_satisfaction_ca How satisfied do you think your partner is with the negotiation outcome? Present your answer as one of the following multiple choice options. You must select an option. A: extremely_dissatisfied / B: slightly_dissatisfied / C: undecided / D: slightly_satisfied / E: extremely_satisfied end_partner_deal_likeness_ca How much do you think your partner likes you? Present your answer as one of the following multiple choice options. You must select an option. A: extremely_dissatisfied / B: slightly_dissatisfied / C: undecided / D: slightly_satisfied / E: extremely_satisfied end_deal_total_dnd end_deal_total_ca How many points did you get at the end of the negotiation? Present your answer as a single number with no additional text. end_deal_specifics_ji In the final deal, what value was agreed on for each issue? Present your answer as a JSON within <answer></answer>tags with keys as issues (Company, Position, Workplace, Salary, Days_off) and values as the corresponding answers. If you are unsure, pick your best guess. end_deal_specifics_dnd In the final deal, how many items of each issue did you get? Present your answer as a JSON within <answer></answer>tags with keys as issues (books, hats, and balls) and values as the corresponding answers. If you are unsure, pick your best guess. end_deal_specifics_ca In the final deal, how many items of each issue did you get? Present your answer as a json within <answer></answer>tags with keys as issues (food, water, and firewood) and values as the corresponding answers. If you are unsure, pick your best guess. end_deal_satisfaction_ca How satisfied are you with the negotiation outcome? Present your answer as one of the following multiple choice options. You must select an option. A: extremely_dissatisfied / B: slightly_dissatisfied / C: undecided / D: slightly_satisfied / E: extremely_satisfied end_deal_likeness_ca How much do you like your partner? Present your answer as one of the following multiple choice options.You must select an option. A: extremely_dissatisfied / B: slightly_dissatisfied / C: undecided / D: slightly_satisfied / E: extremely_satisfied Table E.9: Questions used in the prompts for all tasks. scratch, while for evaluating generated responses, we utilized existing packages for BLEU2 and ROUGE-L3 respectively. 2https://pypi.org/project/evaluate/ 3https://pypi.org/project/rouge-score/ 208 E.4 Task-Specific Analysis Figure E.1 presents the results for few-shot prompting and incorporating previous utterances for various Annotation tasks. Figure E.2 shows the results for Chain-of-Thought (CoT) prompting technique on four multi-hop arithmetic reasoning Comprehension tasks. Finally, we present the effect of varying the number of utterances seen by the models for preference prediction tasks in Figure E.3. (a) Evaluation on dur_dial_act_dnd task (b) Evaluation on dur_dial_act_ji task (c) Evaluation on dur_dial_act_cra task (d) Evaluation on dur_strategy_ca task Figure E.1: Evaluation on four tasks demonstrating the impact of 1) Two-shot prompting and 2) Incorporating two prior utterances. We only consider GPT-4 for this analysis. F1 refers to the macro F1 score over all labels. 209 (a) Evaluation on sta_max_point_ca task (b) Evaluation on sta_max_point_dnd task (c) Evaluation on end_deal_total_ca task (d) Evaluation on end_deal_total_dnd task Figure E.2: Evaluation on four tasks demonstrating the impact of CoT prompting. We only consider GPT-4 for this analysis. E.5 Generation Outputs We present sample generations by the compared models along with the ground-truth reference in Tables E.10 and E.11. 210 (a) Evaluation on dur_ask_high_priority_ca task (b) Evaluation on dur_ask_low_priority_ca task (c) Evaluation on dur_partner_ask_high_priority_ca task (d) Evaluation on dur_partner_ask_low_priority_ca task Figure E.3: Evaluation on four tasks demonstrating the impact of increasing the number of utterances seen by the models. We only consider GPT-3.5 and GPT-4 for this analysis. X-Axis denotes the number of utterances from the start of the dialogue that are used in the prompts. 211 Counts Books: 1, Hats: 3, Balls: 2 Agent Values Each Book: 1 points, Each Hat: 1 points, Each Ball: 3 points Dialogue THEM: i’m interested in the book and both balls in exchange for the hats . YOU: definitely not both balls . we could split them ? THEM: i think that’s fair . i agree . 1 book , 1 ball for me . . . rest for you . Response Coherency Strategy Reference YOU: deal ! 5 5 Flan-T5 YOU: deal. 5 5 Mistral7b YOU: I understand your offer, but I’m not willing to give up both balls. How about we split the balls and the hats evenly? That way, we can both get what we want. 2 4 Wizard13b YOU: 1 book, 1 ball for me. 1 2 Vicuna13b YOU: 1 book, 1 ball for you, and the rest for me. 2 5 Vicuna33b YOU: Sounds good. I’ll take the remaining book and both hats, and you can have the ball. 2 3 GPT-3.5 YOU: Sounds good. I’ll take the 3 hats then. Deal. 5 5 GPT-4 YOU: Great, so to confirm, I’ll take 2 hats and 1 ball, and you get 1 book and 1 ball. Do we have a deal? 3 4 Table E.10: Sample responses generated by the models for an instance from the DND dataset. The Coherency and Strategy scores are based on human annotations on a scale of 1-5. Counts Food Packages: 3, Water Packages: 3, Firewood Packages: 3 Agent Values Each Food Package: 3 points, Each Water Package: 4 points, Each Firewood Package: 5 points Dialogue YOU: hello THEM: Hello, there. Should we discuss how to divide up the items for the camping trip? YOU: yeah. i need 2 firewood, 1 water and 1 food THEM: I need to have 2 or 3 pieces of firewood because I did not pack any way to heat up my food without firewood. Response Coherency Strategy Reference YOU: I also need 2 pieces of firewood 4 3 Flan-T5 YOU: i need 2 firewood, 2 water and 1 food 3 3 Mistral7b YOU: Based on the recent dialogue history, it seems like we both need firewood for our camping trip. How about we agree to split the 3 firewood packages equally, and I’ll take the 1 water and 1 food package? 5 4 Wizard13b YOU: I understand, I can provide you with 2 firewood, 1 water, and 1 food. How about we split the remaining items equally? 5 3 Vicuna13b YOU: I understand your need for firewood. How about we split the firewood packages evenly, with each of us getting 1 and a half packages? That way, you’ll have enough firewood for your needs, and I’ll still have some for my camping trip. 5 3 Vicuna33b YOU: I understand your need for firewood. Let’s propose a deal: I’ll take 1 firewood, 1 water, and 2 food packages. In return, you can have the remaining 2 firewood packages. This way, we both get what we need. 5 2 GPT-3.5 YOU: I can give you 2 firewood, 1 water, and 1 food package. 4 5 GPT-4 YOU: How about if I take 2 firewood and you take 1, and in exchange, I can let you have 2 water packages? 4 4 Table E.11: Sample responses generated by the models for an instance from the CaSiNo dataset. The Coherency and Strategy scores are based on human annotations on a scale of 1-5. 212 Appendix F Mixed-Motive Tasks Across Diverse Domains: A Literature Review F.1 Literature Compilation In this section, we provide details about how the literature was curated for our survey. We hope this helps the overall reproducibility and also guides similar studies in the future. We followed a simple two-stage process. First, we compiled the relevant datasets across diverse domains (presented in Section 8.2), and then, we compiled the techniques developed on these datasets (presented in Section 8.3). Step I - Datasets: We primarily focused on gathering datasets that involve logs for dialogue interactions but also included the datasets based on transcripts from multimodal interactions as well. Given the large breadth of research in this space across a number of domains, our collection is not exhaustive but is rather restricted to the following sources. We surveyed the past 6 years of *ACL conference proceedings. We then covered several online repositories of dialogue data to capture datasets published at other venues. This includes ParlAI1 , 1https://github.com/facebookresearch/ParlAI 213 Huggingface2 , NLP-Progress3 , and Convokit4 . Further, we revisited several recent surveys in dialogue systems and Natural Language Generation (NLG) research (Huang et al., 2020; Zhang et al., 2020c; Duerr and Gloor, 2021; Ni et al., 2023). Datasets that were categorized as task-oriented or open-domain in these surveys but also contain some aspects of social influence have been included in our discussion. As discussed in Section 8.3, we also include the datasets that have not been directly used for designing dialogue systems but rather for various Natural Language Understanding (NLU) subtasks that can be crucial for the eventual development of dialogue systems in this space. Finally, we also reviewed the citation graphs of the collected papers from Google Scholar. Overall, we ended up with 22 dataset papers, spanning 12 publication venues, 4 languages, and 7 application domains. Step II - Methods: Compiling the methodological progress was based on the models developed on the curated datasets. For this purpose, we simply reviewed the citations of all the dataset papers using Google Scholar. F.2 Existing Datasets A comprehensive list of the datasets discussed in Chapter 8 is provided in Table F.1. For each dataset, we mention the application domain, source, and key statistics, as well as the available metadata and annotations apart from the conversation logs. 2https://huggingface.co/docs/datasets/index 3http://nlpprogress.com/english/dialogue.html 4https://convokit.cornell.edu/documentation/datasets.html 214 F.3 Five-Stage Theoretical Model We develop a five-stage framework to summarize our recommendations for future work. These stages cover key decisions in the design of a dialogue system in this space, encouraging a holistic understanding of the system characteristics, target audience, underlying modeling techniques, and evaluation mechanisms. These steps are inspired by a behavior change model in healthcare research (Zhang et al., 2020a). We adapt this model to make it suitable for our case. We present these steps in Figure F.1. 215 Figure F.1: A theoretical framework for designing dialogue models for mixed-motive tasks. Curved arrows represent forward relations and the straight arrow represents the feedback. I. Task Specifications: Key properties that define the task in consideration and are captured by the collected dataset, II. Chatbot Characteristics and User Backgrounds: Attributes for the agent design and target audience, III. Chatbot Capacity: The desirable capabilities of the system, IV. Chatbot Design & Techniques: The modeling techniques to develop the dialogue system, and V. Evaluation Mechanisms: Metrics to evaluate system performance. 216 Name (Citation) Domain Source Key Statistics Metadata & Annotations STAC (Asher et al., 2016) Games Crowdsource Dialogues: 1081 Turns/Dialogue: 8.5 Tokens/Turn: 4.2 Dialogue Acts; Discourse Structures Diplomacy (Peskov et al., 2020) Games Crowdsource Games: 12 Messages/Game: 1440.8 Words/Message: 20.79+ Intended and perceived truthfulness; Participant demographics and self-assessment of lying abilities; Ground-truth betrayals DinG (Boritchev and Amblard, 2021) Games University game night logs Games: 10 Turns/Game: 2357.5 Annotated question-answer pairs Tabletop (DeVault et al., 2015) MIBT Face-to-face, Wizard-of-Oz Face-to-face Dialogues: 89 Wizard-of-Oz Dialogues: 30 Participant demographics; Subjective questionnaire-based assessment DealOrNoDeal (Lewis et al., 2017) MIBT Crowdsource Dialogues: 5808 Turns/Dialogue: 6.6 Tokens/Turn: 7.6 — CaSiNo (Chawla et al., 2021b) MIBT Crowdsource Dialogues: 1030 Utterances/Dialogue: 11.6 Tokens/Utterance: 22 Participant demographics and personality traits; Outcome satisfaction; Partner perception; Strategy Annotations JobInterview (Yamaguchi et al., 2021) MIBT Crowdsource Dialogues: 2639 Turns/Dialogue: 12.7 Words/Turn: 6.1 Dialogue acts PersuasionforGood (Wang et al., 2019) Social Good Crowdsource Dialogues: 1017 Turns/Dialogue: 10.4 Words/Turn: 19.4 Participant sociodemographic, personality, and engagement in the conversation; Strategy annotations; Dialogue Acts CraigslistBargain (He et al., 2018) E-commerce Crowdsource Dialogues: 6682 Turns/Dialogue: 9.2 Tokens/Turn: 15.5 Dialogue Acts AntiScam (Li et al., 2020) E-commerce Crowdsource Dialogues: 220 Turns/Dialogue: 12.4 Words/Turn: 11.1 Dialogue Acts; Semantic Slots Motivational Interviewing (Tanana et al., 2016) Therapy & Support Psychotherapy session logs Sessions: 341 Utterances/Session: 513.2 Words/Utterance: 9.7 Behavior annotations SMS Counseling (Althoff et al., 2016) Therapy & Support SMS chat logs Dialogues: 80,885 Messages/Dialogue: 42.6∗ Words/message: 19.2∗ Post-conversation assessment for both the counselor (e.g., suicide risk, main issue etc.) and user (how they feel afterwards) EmpatheticDialogues (Rashkin et al., 2019) Therapy & Support Crowdsource Dialogues: 24,850 Utterances/Dialogue: 4.3 Words/Utterance: 15.2 — Hotline Counseling (Demasi et al., 2019) Therapy & Support Synthetic Transcripts Dialogues: 254 Messages/Dialogue: 40-60 Paraphrases by trained counselors mPED (Liang et al., 2021) Therapy & Support Physical activity clinical trials Sessions: 107 Turns/Session: 423.5 Minutes/Session: 28.8 Demographics; Physical activity related pre and post surveys; Strategy annotations Congressional Debates (Thomas et al., 2006) Argumentation Congressional transcripts Debates: 53 Segments/Debate: 72.8 Ground-truth label with each speech segment for support/oppose of the proposed bill Supreme Court (D.-N.-M. et al., 2012) Argumentation Oyez.org transcripts Cases: 7700 Utterances/Case: 220.8 Case-related metadata such as key dates, citation, parties involved, and voting results DeCour (Fornaciari and Poesio, 2012) Argumentation Italian court hearings Hearings: 35 Utterances/Hearing: 173.4 Tokens/Utterance: 13.9# Metadata for testimonies like place, date, demographics; Hearing related details; Truthfulness annotations ChangeMyView (Tan et al., 2016) Argumentation Reddit Discussion Trees: 20,626 Nodes/Tree: 61.1 The original post with initial arguments and explicitly recognized successful persuasive arguments from the opposing side DDO Debates (Durmus and Cardie, 2019) Argumentation debate.org logs Debates: 78,376 Messages/Debate: 7.7 User demographics; Debate metadata like dates and category; Audience votes and comments Court Debates (Ji et al., 2020) Argumentation China Court transcripts Dialogues: 260,190 Utterances/Dialogue: 13.9 — Target-Guided (Tang et al., 2019) Miscellaneous Crowdsource Dialogues: 9939 Utterances/Dialogue: 11.4 — Table F.1: Dialogue datasets grouped by task domains and sorted by publishing year within a domain. Statistics of the form X/Y denote average numbers. MIBT: Multi-Issue Bargaining Task. ∗Only computed for dialogues with additional survey responses, +Only computed for training data. #Only for Speaker utterances in front of the judge (does not include other members of the court). Not all datasets listed above have been directly used for designing end-to-end dialogue systems, but instead, these have enabled research into various useful sub-tasks and analyses. Please refer to Section 8.2 for a detailed discussion about these datasets and to Section 8.3 for information about various methods developed using them. 217
Abstract (if available)
Abstract
Social interactions often involve a mixture of motives. People seek to maximize their own interests without undermining the needs of others. Success in these interactions, referred to as mixed-motive interactions, demands a balance between self-serving and other-serving motives. For instance, in a typical negotiation, a player must balance maximizing their own goals with the goals of their partner so as to come to an agreement. If the player asks for too much, this can push the partner to walk away without an agreement, hence hurting the outcomes for all the parties involved. Such interactions are highly prevalent in everyday life, from deciding who performs household chores to customer support and high-stakes business deals. Consequently, automated systems capable of comprehending and participating in these strategic environments with human players find broad downstream applications. This includes advancing conversational assistants and the development of tools that make everyday social interactions more effective and efficient (e.g., by acting as a content moderator or a coach). Additionally, these systems hold a huge potential to transform pedagogical practices by dramatically reducing costs and scaling up social skills training.
Most efforts for automation focus on agent-agent interactions, where thousands of offers are exchanged between the players. These interactions are fundamentally different from human-agent conversations, which are much shorter and naturally involve human subjectivity, which in fact, has been a subject matter of research for decades across several disciplines, including Psychology, Affective Computing, and Economics. Hence, in order to simplify the design, most efforts in human-agent negotiations involve restrictive menu-driven communication interfaces that are based on button clicks and structured APIs for interaction between the human and the machine. This concreteness reduces the design complexity, but it comes at a cost - such interfaces hinder the study and incorporation of several aspects of real-world negotiations, such as complex strategies and emotion expression. Going beyond such constrained designs, it is desirable to incorporate more realistic modes of communication, such as natural language, for their utility in better serving downstream applications - our work aims to fill this gap.
In this dissertation, we present our foundational work for enabling mixed-motive human-machine dialogue, with a focus on bilateral chat-based negotiation interactions. We discuss our progress in three key areas: 1) The design of a novel task and dataset of grounded human-human negotiations that fueled our investigations into the role of emotion expression and linguistic strategies, 2) Techniques for dialogue systems capable of engaging in mixed-motive interactions by learning to strike a balance between self and partner interests, and 3) Defining a research space encompassing such strategic dialogue interactions to promote a research community for dedicated efforts and discussion in this area.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
A framework for research in human-agent negotiation
PDF
Decoding information about human-agent negotiations from brain patterns
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Automated negotiation with humans
PDF
An intelligent tutoring system’s approach for negotiation training
PDF
Computational modeling of mental health therapy sessions
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
An investigation of fully interactive multi-role dialogue agents
PDF
Socially assistive and service robotics for older adults: methodologies for motivating exercise and following spatial language instructions in discourse
PDF
A planner-independent approach to human-interactive planning
PDF
Managing multi-party social dynamics for socially assistive robotics
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Common ground reasoning for communicative agents
PDF
Nonverbal communication for non-humanoid robots
PDF
Enabling human-building communication to promote pro-environmental behavior in office buildings
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Towards generalizable expression and emotion recognition
Asset Metadata
Creator
Chawla, Kushal
(author)
Core Title
Computational foundations for mixed-motive human-machine dialogue
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
03/28/2024
Defense Date
03/21/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,AI,deep learning,Dialogue,human-computer interaction,machine learning,mixed-motive,natural language processing,negotiation,NLP,OAI-PMH Harvest,Psychology,theory of mind
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lucas, Gale (
committee chair
), Gratch, Jonathan (
committee member
), Kim, Peter (
committee member
), Mataric, Maja (
committee member
), May, Jonathan (
committee member
)
Creator Email
kchawla@usc.edu,kushalchawla18@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113859105
Unique identifier
UC113859105
Identifier
etd-ChawlaKush-12728.pdf (filename)
Legacy Identifier
etd-ChawlaKush-12728
Document Type
Dissertation
Format
theses (aat)
Rights
Chawla, Kushal
Internet Media Type
application/pdf
Type
texts
Source
20240328-usctheses-batch-1132
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
affective computing
AI
deep learning
human-computer interaction
machine learning
mixed-motive
natural language processing
negotiation
NLP
theory of mind