Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Common ground reasoning for communicative agents
(USC Thesis Other)
Common ground reasoning for communicative agents
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMMON GROUND REASONING FOR COMMUNICATIVE AGENTS by Pei Zhou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2024 Copyright 2024 Pei Zhou Epigraph To imagine a language means to imagine a form of life. LUDWIG WITTGENSTEIN ii Dedication To my father, mother, and Hào. iii Acknowledgements I want to start with my family. To my mom Min Wang and dad Dajun Zhou whom I haven’t seen in person for almost the entire 5 years of my PhD. It is our memory together in our home Chengdu, your unconditional support to my growth and happiness, and knowing that you are always only one video call away that helped me strive through the challenging journey. I hope that I have made you proud and pray for our reunion soon. To my uncles Dazhi Zhou, Dayong Zhou, aunts Yue Zhou, Xiaoyan Zhou and cousins Yilin Zhou, Emma Zhou, Samuel Zhou whom I have visited in Vancouver, your love and care showered me during the years and provided me a home away from home. To my grandmas Runhua Liu and Shujun Chen, you constantly inspire me with your optimism and courage and seeing your proud face during our calls can always make me feel extremely lucky. To my grandpas Wenchang Zhou and Tinggui Wang who passed away during the last 5 years, I’m so so sorry that I couldn’t make it for the last moments. Please know that when I got hooded with the doctorate degree, I was thinking about your smiles in heaven. I love you all. I hope I made all of you proud and I will continue. To my mentors who I’m incredibly blessed to have along the way. First my PhD advisors Xiang Ren and Jay Pujara. Thank you for all the guidance, trust, and freedom provided on my academic journey. My growth would have no way been the same without your time and efforts. To my committe members Toby Mintz and Jieyu Zhao. Working with you has been extremely rewarding and I enjoyed every conversation. This thesis would not have been here without you. To Muhao Chen, Kai-Wei Chang, and Yizhou Sun, who helped me get to know research and NLP during my undergraduate. I’m lucky to start my academic iv journey learning from you. And to my internship mentors. To Yang Liu, who took in a first year PhD who was still learning in the research world and provided immense help during 2 internships at Amazon and continued guidance even after that. To Prithviraj Ammanabrolu. Raj, no words can describe how much lucky I am to have you as my mentor at AI2. You showed me that an incredibly smart researcher who is also incredibly cool and kind. You will continue to be my role model in the years to come. To Shyam Upadhyay, Swaroop Mishra, and Steven Zheng. My time at Google has showed me what real frontier research hubs are like. When I’m lost in directions and faith, you trusted in me and tapped my shoulders to stand up. I’m really proud of our work together. To Longqi Yang, my future mentor at Microsoft, who has already been an amazingly supportive manager and life friend. I’m incredibly blessed to have met you and have you as my mentor in my next adventure. I can’t wait to share this journey together with you. To my friends who have colored my time at USC so much more beautiful and fun. To my labmates and my dear friends Bill Yuchen Lin, Aaron Chan, Woojeong Jin, Qinyuan Ye, Jun Yan, Xisen Jin, Shushan Arakelyan, Soumya Sanyal, Brihi Joshi, Jacob Bremerman, Sahana Ramnath, Huihan Li, Hirona Arai, Matthew Finlayson, Pegah Jandaghi, Kexuan Sun, Avijit Thawani, Dong-Ho Lee, Kian Ahrabian, Yifan Jiang, and Eric Boxer. Our socials, jokes, selfless help to each other keep me sane during my PhD. I would do it again just for our time together. My long-time collaborators and friends Justin Cho, Ninareh Mehrabi, Rahul Khana, Aman Madaan, and Hyunwoo Kim. I’ve learned so much from you all and am looking forward to our future reunions! To my academic PhD buddies Zihao He, Yufeng Yin, Qinyang Shou, Yue Yu, Jiao Sun, Nan Xu, Tenghao Huang, Fei Wang, Hao Zhu, Xuhui Zhou, Valentina Pyatkin, Mozhdeh Gheini, Weijia Shi, Yihe Deng, Yu Yang, Ameya Godbole, Ziyi Liu, and many more, the real AGI is the friend we made along the way and I sincerely hope that our paths cross again. To my dear life friends Jumo Yang, Wen Chen, Zilong Zhang, Yiwen Cao, Changhao Li, George Weng, Rika Liu, Angela Song, Yuanyue Gu, Jojo, Tingshuo Yi, Zheqin Li, Yinqi Huang, Zihang Pan, Hao Wang, Jue Wang, Ziheng Fang, and it goes on, v I cherished every second of our time together – boardgaming, drinking, singing, hiking, stargazing, going to concert, laughing – and you all have made my life so much more joyful. To my starlight, my jasmine, and my summer-night breeze, Hào Huynh. I love you to the deepest ` of my heart and that cannot repay how much you have supported, loved, and sacrificed yourself for me on this journey. Thank you for the twice a week long rides to my place for a long time, thank you for immediately showing up at my darkest time, thank you for reigniting my faith in myself, thank you for all our happiness together, and thank you for smiling like the warmest star on an August night. You are part of my soul in past, present, and future. You are the grandeur of life and I am so blessed to continue walking with you into tomorrow. Finally, to Pei Zhou. You made it and I’m so proud of you. Thank you for running through all the stress and anxiety, thank you for always being curious about yourself and the world, thank you for protecting the little boy who craves for love and happiness, thank you for seeking help at the hardest time, thank you for being confident and optimistic for tomorrow, thank you for working hard towards self-realization. Give yourself the biggest hug and let’s adventure on... vi Table of Contents Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Why Do We Need Communicative AI Agents? . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 How Do We Communicate and What Is Common Ground? . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Conversational AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Open Domain Dialogue Response Generation . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Dialog Response Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Goal-Driven Grounded Dialogue Agents . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Commonsense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Why Commonsense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Evaluating Commonsense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Theory-of-Mind and Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Theory-of-Mind in Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Language Models and Theory-of-Mind . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Large Language Models and Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Large Language Models and Agents . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.2 Prompting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.3 Reasoning and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 3: Improving Language Model Conversational Capabilities by Leveraging Common Sense Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 vii 3.2.2 Implicit Knowledge Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Learning to Generate Implicit Knowledge by Self-Talk . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Knowledge-Aligned Dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.3 Knowledge-Dialogue Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 Compared Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.3 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Performance of Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Quality of Generated Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.7 Ethics and Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 4: Inference-Based Diverse Common Ground Further Improves Dialogue Response Quality 34 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Inference-Based Common Ground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Grounding in Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Formulating CG in Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Collecting Reflect Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Pre-Collection: Selecting Base Dialogue Turns for Expansion . . . . . . . . . . . . 40 4.3.2 Stage 1. Inference Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.3 Stage 2. Response Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.4 Data Collection Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.5 Quality Control and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Limitations of Reflex-Prone Dialogue Data and Models . . . . . . . . . . . . . . . . . . . . 44 4.4.1 Human Evaluation Dimensions-SSI . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Comparing Original vs Reflect Responses . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.3 Comparing Reflex RG vs Reflect Data . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 A Little CG Goes a Long Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7 Ethics and Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 5: Modeling Theory-of-Mind For Goal-Driven and Grounded Communication . . . . . . . 61 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 G4C and G-Dragon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 G4C Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.2 D&D Dialogue Generation as a Partially Observable Markov Decision Process . . . 64 5.2.3 Play-By-Post D&D Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.4 Creating the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Theory-of-Mind Inspired Guidance Generation in Grounded Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 viii 5.3.1 Modeling Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.2 Modeling (Limited) Theory-of-Mind (ToM) Using RL for Guidance Generation . . . 72 5.4 Evaluating G4C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Measuring Fluency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.2 Measuring Groundedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.3 Measuring Fulfillment of Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.1 Compared Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.2 Correlation Analysis of Automatic Metrics . . . . . . . . . . . . . . . . . . . . . . . 76 5.5.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.7 Ethics and Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.8.1 Why only training a DM model to generate guidance instead of everything a DM says? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.8.2 How generalizable is a DM model on other domains? . . . . . . . . . . . . . . . . . 82 Chapter 6: Eliciting Planning in Large Language Models by Self-Discovering Reasoning Structures 86 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Self-Discovering Reasoning Structures for Problem Solving . . . . . . . . . . . . . . . . . . 89 6.2.1 Stage 1: Self-Discover Task-Specific Structures . . . . . . . . . . . . . . . . . . . . 90 6.2.2 Stage 2: Tackle Tasks Using Discovered Structures . . . . . . . . . . . . . . . . . . 91 6.2.3 Prompt Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.1 Does Self-Discover Improve LLM Reasoning? . . . . . . . . . . . . . . . . . . . . 96 6.4.2 Which Types of Problems Do Self-Discover Help the Most? . . . . . . . . . . . . 98 6.4.3 How Efficient is Self-Discover? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.4 Qualitative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.5 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Deep Diving Into Self-Discovered Reasoning Structures . . . . . . . . . . . . . . . . . . . . 103 6.5.1 Importance of Self-Discover Actions . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5.2 Towards Universality of Discovered Reasoning Structures . . . . . . . . . . . . . . 104 6.5.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5.4 Further Anaysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.1 Impact and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.1 Self-Learning with Evolving Common Ground . . . . . . . . . . . . . . . . . . . . . 112 7.3.2 Interactive Super-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.3 Safety to Deploy in the Wild from Robust Reasoning . . . . . . . . . . . . . . . . . 113 ix Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 x List of Tables 3.1 Knowledge representation mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Dialogue data statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Human evaluation on response quality when comparing different model variants. We show the percentage of times annotators prefer each variant to TBS-hard-symbol-NL and ties, i.e. wins/ties%. Bold-faced numbers indicate statistical significance (p < 0.05) improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Automatic evaluations using multiple metrics on response quality. All models are based on DialoGPT-medium. Bold-faced are the best performance. One “*” indicates statistical significant (p < 0.05 in Wilcoxon signed-rank test) improvement upon the best-performing non-GT baseline and “**” indicates significant improvement upon the GT baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Human evaluation on knowledge quality. For models that generate novel (not in ConceptNet) knowledge, we show non-novel/novel percentages. “*” means knowledge is from ConceptNet (not generated). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 Randomly sampled qualitative examples from our TBS model (TBS-hard-symbol-QA). . . 30 3.7 Randomly sampled qualitative examples from TBS-hard-symbol-QA. We can find that in most cases the generated knowledge makes sense and responses are grounded in the knowledge and not generic. However, we do find some error cases such as the last example the response does not use the knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Inference dimensions and corresponding questions . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Human evaluation on inference (CG) quality. We compare inferences from three resources and compare their sensibility and relevance to dialogue context. . . . . . . . . . 43 4.3 Evaluation dimensions for RG with examples (dialogue context from Figure 5.1). . . . . . 45 xi 4.4 Randomly sampled dialogue example to qualitatively compare responses from different resources and models. Compared to original and GPT3 top-5 responses which mostly express “You are a great nurse/doctor”, Reflect responses based on the inference QA pairs are much more specific and engaging. We find that GPT3-InfQ, though only guided by an inference question, generates less genric responses such as “... I hope you get a raise soon.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.1 Example dialogue transcript from D&D game play. . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Example dialogue transcript from D&D game play with explanations. . . . . . . . . . . . . . . . 67 5.3 Model variants. All targeted outputs are guidance from DM. All training data size is 41k except for human-label (2k). The test set (1k) is shared across all. . . . . . . . . . . . . . . 74 5.4 Results on the 3 dimensions using metrics from Section 5.4 comparing models that use IDM-generated pseudo-labels and human-generated labels. . . . . . . . . . . . . . . . . . 75 6.1 All 39 reasoning modules consisting of high-level cognitive heuristics for problem-solving. We adopt them from [39]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Self-Discover significantly improves LLM reasoning across a diverse set of 25 complex tasks: BBH, T4D and MATH. CoT: zero-shot Chain of Thought [67]. PS: plan-and-solve prompting [150]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 Big Bench-Hard [136] per-task performance of GPT-4 and PaLM 2-L with Self-Discover. 98 6.4 Additional baselines including Tree-of-Thought (ToT) and Graph-of-Thought (GoT). . . . . 102 6.5 MMLU [136] per-task performance of GPT-4 and PaLM 2-L with Self-Discover. We sampled 10 tasks with 50 examples each. SD (instance) refers to that we run stage one on each question and use the generated structure during solving, to acount for the diversity of questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6 Examples of wrong reasoning structures for MATH. The first error in the reasoning structure is highlighted in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.7 Examples of wrong calculations for MATH. The first error in the intermediate computations is highlighted in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xii List of Figures 3.1 A motivating example for our study. We look to train models to externalize the implicit knowledge grounding step by explicitly generating knowledge before responding. 14 3.2 Method illustration. We first propose matching approaches to construct knowledgealigned dialogues. Then we consider different alternatives to represent implicit knowledge. Finally, we connect knowledge and dialogue and ask models to generate both knowledge and responses given history. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Human evaluation interface for response quality on dimensions: grammar, coherence, and engagingness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Human evaluation interface for response quality on dimensions: informativeness, specificity, and common sense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Human evaluation interface for knowledge quality with 3 questions: does the knowledge make sense as a standalone fact, is the knowledge relevant to the context, and does the generated resposne use the knowledge? . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Human evaluation results for pairwise comparison between TBS and a baseline. We show preference percentages for each model. “*” indicates statistical significance difference. For TBS we show averaged preferences. . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Human evaluation comparing TBS with models that have access to ground-truth responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.8 Effects of noisy knowledge on response quality. . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 A motivating example. We aim to help RG models produce more human-like responses instead of generic ones. We argue that integrating common ground by making inferences is crucial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Illustration of different RG approaches. Common RG does not model CG explicitly, TBS [183] and CICERO [49] post-hoc augments dialogues with CG, and we aim to follow natural human communication and first collect CG and then responses based on CG. We also factor in the diversity of plausible responses given a dialogue context that result from distinct CG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 xiii 4.3 Reflect collection procedure illustration. We first collect CG materialized as inferences expressed in QA along different dimensions. Then for each QA pair, we collect multiple responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Crowdsourcing principles for two-stage collection. . . . . . . . . . . . . . . . . . . . . . . 42 4.5 GPT3-Few Shot Prompting Format (no inference). . . . . . . . . . . . . . . . . . . . . 46 4.6 GPT3-Few Shot-Inference Question Prompting Format. . . . . . . . . . . . . . . . . 46 4.7 Response evaluation separated by inference dimensions. We find that GPT3-FS-InfQ generate better responses than humans on the potential consequences dimension while generates worse on attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.8 Human evaluation comparing human dialogues: original data and our collected Reflect. 48 4.9 Averaged performance comparing Reflect with reflexive models. Predictably, RG models trained to directly output responses from dialogue context using no-common ground responses have a large gap in response quality compared to human responses guided by inferences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.10 Averaged performance comparing before and after reflexive models leveraged inference-guided Reflect data. We find that inference-based common ground prompting helps GPT3-175B significantly, even comparable to human responses from Figure 4.8. Blender also gained much improvement from pre-trained after fine-tuning on Reflect, however we find no significant effects on appending inference questions. . . . . . . . . . 51 4.11 Response evaluation separated by inference dimensions. We find that GPT3-FS-InfQ generate better responses than humans on the potential consequences dimension while generates worse on attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.12 Inference collection UI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.13 Inference collection examples for turkers. . . . . . . . . . . . . . . . . . . . . . . . . 56 4.14 Response collection UI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.15 Response collection examples for turkers. . . . . . . . . . . . . . . . . . . . . . . . . 58 4.16 SSI evaluation UI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.17 SSI evaluation instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 A motivating example. The (human) Dungeon Master (DM), knowing the desired story path, intends the players to perform actions to find out about the goblins—the first plot point that will eventually lead the players to treasure. They generate the guidance “You notice some movements in the bushes” using theory-of-mind by inferring that the players will perform the desired actions upon hearing their words. . . . . . . . . . . . . . . . . . . 62 xiv 5.2 Illustration of IDM. We collect 2.5k human labels on guidance and train an IDM labeler to generate pseudo labels for unlabeled large corpus. . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Sketches of the three categories of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Illustration of intent modeling. We first mine intents from LLM and then train an intent generator to generate intent as additional context to train the DM model. . . . . . . . . . . 70 5.5 Illustration of our ToM-Inspired RL by using a reward function to help DM model anticipate what the players might do upon hearing the generated guidance. We give the model a reward if the predicted player action matches the intent given. . . . . . . . . . . . 71 5.6 Results comparing implicit and explicit intent models. We observe models with intent generate dramatically more guidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Human Evaluation comparing non-ToM and ToM models with mined (Left) and generated (Right) intents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.8 Summary of performance on different evaluation aspects from the BEST 1) implicit intent model (IDM-Label 41k, 2) explicit intent model (Mined Intent), and 3) intent with ToM-inspired RL (ToM-RL Generated Intent). . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.9 Inference collection collecting guidance labels. . . . . . . . . . . . . . . . . . . . . . 83 5.10 Inference collection collecting guidance labels. . . . . . . . . . . . . . . . . . . . . . 84 5.11 Evaluation interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 Self-Discover guides LLMs to self-discover and compose atomic reasoning modules into a reasoning structure to solve challenging tasks. Through testing on challenging reasoning benchmarks including Big Bench-Hard (BBH), agent reasoning (T4D), and MATH, we find that Self-Discover outperforms Direct Answering on 23/25 and CoT on 21/25 tasks in zero-shot setting using PaLM 2-L. . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Illustration of using Self-Discover for problem-solving. Given a generative LM, task, and seed reasoning module descriptions, we guide LMs to generate a reasoning structure in key-value format to solve the task. Finally, models can follow the selfdiscovered structures to solve the every instance from the task by filling in the values in JSON step-by-step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3 Illustration of three actions of Self-Discover. We use LMs to compose a coherent reasoning structure by selecting relevant modules, adapting to task-specific descriptions, and implement a reasoning structure in JSON. . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Meta-Prompts for the three actions of Self-Discover. Each meta-prompt consists of an instruction in the beginning and the end, reasoning module descriptions, and task examples without labels. For IMPLEMENT, to show model an example of a reasoning structure (plan), we present a human-written structure in JSON for another task. . . . . . . . . . . . . . . . . . . . . 92 xv 6.5 Breakdown of Self-Discover performance improvement on 4 categories on PaLM 2-L. Self-Discover performs the best on tasks requiring world knowledge. . . . . . 99 6.6 Comparison of accuracy with number of inference calls required per instance. For CoT-Self-Consistency, we sample 10 times. Best of each RM method requires gold labels (*). Self-Discover requires only 1 inference call per instance (plus 3 more metaprompts on the task-level), same as Direct and CoT while reaching better performance compared with 40x more call required methods (majority voting of each RM) on GPT-4. We acknowledge that Self-Discover input and output are longer than CoT and Direct prompting, increasing cost. However, as the number of instances increases, the efficiency of Self-Discover in terms of inference per instance is highly desirable. . . . . . . . . . . . 100 6.7 Examples of self-discovered structures on BBH tasks using PaLM 2-L. We observe traits of atomic reasoning modules such as “step-by-step thinking”, “reflect on task nature”, and an interesting creative thinking case where models devise an algorithm using stack to solve parenthesis parsing task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.8 Comparison of generated reasoning process from CoT, Plan-and-Solve, and Self-Discover on BBH-geometric shape task. Both CoT and Plan-and-Solve incorrectly asserts that the path does not form a regular shape as it is not a closed path (highlighted in red) and arrive at a wrong answer. The reasoning structure (in blue Courier font) from Self-Discover first breaks down each line segment and analyze the coordinates carefully, then leverages logical reasoning to conclude that it forms a closed shape as the path ends at the same coordinate (highlighted in purple and orange), and selects the correct answer through final reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.9 Ablation study on three Self-Discover actions on 4 reasoning tasks: all three actions are beneficial for task-solving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.10 Transferrability tests of optimized prompts (OPRO) and composed structures (Self-Discover). The results shown are from GPT-4 using the prompts and structures optimized or composed using PaLM 2-L. We find that self-discovered reasoning structure transfers more robustly than optimized prompts. . . . . . . . . . . . . . . . . . . . . . . . . 105 6.11 Case study of human-written structure shares commonalities with LLM-discovered reasoning structure. We observe similar reasoning patterns–both structures contain step-wise analysis of each instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xvi Abstract Natural language communication is central to all human activities and intelligence in nature. A truly helpful and natural AI agent should first and foremost be a capable communicative agent as it enables effective collaboration with humans and provides enormous social utility. Human communication strives to achieve common ground, consisting of mutual beliefs and common knowledge. This dissertation explores the role of common ground reasoning in developing communicative agents capable of rich and effective interactions. While recent advancements in large language models (LLMs) have demonstrated impressive conversational abilities, significant challenges remain in enabling AI agents to understand implicit intents, reason about shared knowledge, and achieve communicative goals. This work argues that equipping agents with the capacity for common ground reasoning is crucial to addressing these challenges. I investigate three key areas in this thesis: (1) Enhancing conversational models with common sense knowledge: This section explores how integrating external knowledge bases and developing novel methods for implicit knowledge representation can improve an agent’s understanding of conversational context and unstated information. (2) Modeling theory-of-mind capabilities to build goal-driven dialogue agents: This part delves into equipping agents with the ability to model the mental states of their conversational partners, enabling them to reason about beliefs, desires, and intentions, ultimately leading to more effective goal-oriented dialogue. (3) Eliciting metacognition by planning reasoning strategies for diverse scenarios: This section examines how agents can dynamically adapt their strategies based on the specific context and goals, demonstrating metacognitive abilities crucial for navigating complex communicative scenarios. xvii This work contributes towards building communicative agents that can effectively collaborate and interact with humans in natural and intuitive ways by pushing the boundaries of common ground reasoning in AI. It lays the groundwork for future research in areas such as lifelong learning, interactive learning from super-human teachers, and ensuring the safety and reliability of deployed communicative agents. xviii Chapter 1 Introduction 1.1 Why Do We Need Communicative AI Agents? Communication is a defining characteristic of human intelligence that is essential for establishing and maintaining our social world. We have shared stories by verbal and non-verbal communication since ancient times and communication has enabled collaboration allowing humanity to achieve some of the most significant advances in human history. It is thus natural to build Artificial Intelligence (AI) that can smoothly and effectively communicate with humans. Communication capabilities are crucial for both of AI’s social utility during everyday conversations and empathy-requiring scenarios and practical utility when AI is assisting us in work or autonomously completing tasks given instructions. AI has made remarkable strides in Natural Language Processing (NLP) in recent years with the advent of language models (LM). These models, trained on vast corpora of text data, are capable of generating human-like text, answering questions, and even engaging in dialogue. However, despite their impressive linguistic capabilities, current language models still fall short of being truly capable communicative agents. One key deficiency is that language models lack the ability to effectively establish and reason about common ground (CG) - the shared knowledge, beliefs, and assumptions that form the basis for smooth communication between humans. Without common ground reasoning, interactions with language models 1 often feel unnatural, as the models struggle to grasp context, resolve ambiguity, infer intent, and engage in the kind of perspective-taking and collaborative problem-solving that characterize human communication. 1.2 How Do We Communicate and What Is Common Ground? Human communication is a collaborative effort [52, 2, 15] where participants strive to achieve common ground (CG), consisting of mutual beliefs and common knowledge [134, 27, 26], implicit during communication. Conversational AI systems, while able to produce fluent texts, often generate generic and dull dialogue responses [125, 174], potentially because they do not explicitly model CG in communication. Specifically, existing models mostly follow a dialogue history → response training paradigm since such data can be easily obtained in the wild, skipping an important middle step that builds common ground, which naturally and universally exists in human communication, i.e., dialogue history → common ground → response. Moreover, the same history can yield numerous responses, predicated on the CG and intent of the responder. We conjecture that the omission of modeling CG explicitly is a crucial bottleneck in RG models because they are directly trained to produce responses without learning how and why those responses are uttered. In addition, humans communicate with a goal in mind and use language to reach the goal by interacting with their communication partners grounded in a shared environment [52, 2, 27, 26]. To make sure the goal is reached, we often anticipate how the partners will respond in advance to steer the conversations in the desired direction. This ability to reason about the mental states of conversation partners – i.e., theory-ofmind [ToM; 104] – is key to smooth and efficient communication [101, 56]. Most existing dialogue agents, while able to produce human-like responses, often do not model communicative intents or ToM explicitly. In this paper, we investigate if models benefit from explicitly incorporating intents and ToM in NLG. 2 1.3 Thesis Statement and Contributions In summary of previous sections: in the evolving landscape of artificial intelligence, communicative agents have been increasingly pivotal in shaping the interaction between humans and machines. The ability of these agents to understand and generate human language has experienced significant advancements through the development of sophisticated language models. However, despite these technological strides, a critical gap remains apparent: language models have not yet reached the status of fully capable communicative agents. At the heart of this limitation is the models’ inability to engage in the nuanced, context-dependent, and deeply cooperative process that characterizes human communication. This dissertation asserts that the essence of effective human communication lies in the ability to establish and leverage common ground—a shared base of knowledge and assumptions that enables smooth and efficient exchanges. In this dissertation, I argue that for language models to progress from merely capable language generators to competent communicative agents, they must be imbued with the ability to reason about common ground. I propose that this can be achieved through three key avenues: 1. Incorporating common sense knowledge: Models need to be equipped with a broad base of everyday knowledge about how the world works in order to interpret language in context and engage in meaningful exchanges. Integrating common sense reasoning capabilities will allow models to make reasonable inferences and have more substantive conversations. 2. Modeling theory-of-mind: Smooth communication requires the ability to model the beliefs, intentions, and mental states of one’s interlocutor. By giving language models the ability to reason about the other party’s knowledge, goals, and likely interpretations, they can tailor their responses more appropriately and collaboratively work towards mutual understanding. 3 3. Planning reasoning strategies: Different communicative scenarios call for different approaches to establishing common ground. Sometimes direct questions are appropriate; other times, more subtle techniques like analogy or reframing are needed. I propose equipping models with the ability to strategically plan their communicative moves in light of the scenario and goals at hand. Through this three-pronged approach of knowledge integration, theory-of-mind modeling, and reasoning strategy planning, I aim to demonstrate how language models can be elevated to the status of true communicative agents. By establishing common ground through active reasoning rather than just passive pattern matching, models will be able to engage in more robust, meaningful dialogues and better assist humans with complex communicative tasks. The implications span areas as diverse as education, mental health support, collaborative work, and scientific research. In the chapters that follow, I first survey relevant work to date on language models, communicative AI, and common ground, highlighting both progress made and gaps remaining (Chapter 2). I then dive into the technical details of my proposed knowledge integration (Chapter 3 and 4), theory-of-mind (Chapter 5), and reasoning planning approaches (Chapter 6). Finally, I conclude with a discussion of remaining challenges and directions for future work in making language models into capable communicative partners for humans (Chapter 7). 4 Chapter 2 Background This chapter reviews some of the important research related to common ground reasoning for communication in NLP. Since the age of computing, people have been building computational models that are capable of communicating with us. In fact, the famous Turing Test [143] focuses on assessing whether an AI can hold a natural language conversation with humans that we cannot distinguish from a real person. Thus, I will first review recent studies on Conversational AI, ranging from open domain response generation, dialogue evaluation, and goal-driven grounded dialogue scenarios. Next, I will discuss three dimensions of common ground that can enhance model’s communication capabilities – commonsense reasoning, theoryof-mind (ToM), and planning. Commonsense reasoning concerns the ability of an AI system to make presumptions about the everyday world, akin to the intuitive knowledge humans possess. This is important for common ground because when humans communicate, we do not make every assumption explicit. Thus, models need to reason about common sense to communicate smoothly. Theory-of-Mind refers to the ability to infer others’ mental states such as intents, knowledge, and emotions. This is especially crucial for reaching common ground in goal-driven scenarios as a model understanding humans’ intents from their perspective can collaborate to fulfill goals much better. Finally, planning is helpful in multi-round, long-horizon, and complex communication scenarios such as negotiation and collaborate problem-solving. Equipping models with the ability to strategically plan their communicative moves in light of the scenario 5 and goals at hand is critical. I will provide background on research in each dimension in this chapter, as well as a general overview of how to use large language models (LLM), the most popular AI models in recent years, as communicative agents. 2.1 Conversational AI Conversational AI encompasses the technologies and methodologies used to create systems capable of engaging in human-like dialogue. These systems range from simple rule-based chatbots to sophisticated virtual assistants powered by advanced machine learning algorithms. The primary goal of conversational AI is to facilitate natural and meaningful interactions between humans and machines, enabling a wide array of applications from customer service to personal assistants. Early conversational agents relied heavily on predefined scripts and limited natural language processing capabilities. However, recent advancements in machine learning, particularly deep learning, have significantly enhanced the ability of these systems to understand and generate human language. Despite these advancements, challenges remain in achieving truly natural interactions, especially in understanding implicit intents and maintaining coherent, contextually appropriate conversations over extended interactions. I will start with discussing open domain response generation systems that serve the foundation of communicative agents that can maintain everyday conversations. Then, I will go over challenges in evaluating responses. Finally, I will introduce goal-driven conversational agents. 2.1.1 Open Domain Dialogue Response Generation Recently open domain dialog systems have been modeled using end-to-end approaches, more specifically encoder-decoder architectures [131, 125, 124, 149]. Recent work focused on finetuning large pre-trained transformer models [107, 173] on dialog data. Many dialog datasets have been collected with different focuses such as incorporating knowledge [51, 35], empathy [110], task completion [19], consistency [91], 6 personality [170] and reasoning [30] within dialog systems. There has also been work on combining a variety of datasets to exhibit multiple attributes [115]. As discussed above, recent state-of-the-art neural response generation (RG) models based on pretrained language models (LM) mostly produce responses in an end-to-end manner [147, 172, 73], i.e., models are trained to take history and produce a response. Since implicit knowledge is unstated in dialogue history, RG models do not explicitly learn how to reach common ground and may generate uninformative and hallucinated responses [125, 155, 115]. Knowledge-grounded RG [48, 35, 51] addresses this issue, however, most approaches require a knowledge base (KB) to retrieve knowledge for RG [180, 175, 37], which may suffer from the limited knowledge coverage of the used KBs. In Chapter 3, I will introduce our method on incorporating implicit knowledge during grounding for open domain response generation. 2.1.2 Dialog Response Evaluation Due to the diverse responses that a dialog system can output, referenced automatic metrics (such as BLEU, ROUGE, Perplexity) do not correlate well with human judgement of these systems [33, 76] because they calculate word overlap with a single referenced response. As a result, human evaluation has become the de-facto standard to evaluate dialog systems. However human evaluation is costly. Recently model-based metrics have been proposed with good correlation with human annotations [171, 123, 83, 82, 138, 79]. Most metrics focus on evaluating the coherence or appropriatness of a response with respect to its dialog context. [82] identified 18 different dialog qualities such as interesting and topic depth. However none of these metrics evaluate the intermediate grounding stage of a response, i.e., whether the response is grounded in proper knowledge and context. 7 2.1.3 Goal-Driven Grounded Dialogue Agents There is an emerging line of works studying goal-driven situated dialogues [145, 88, 5, 10, 103, 97, 4]. However, intents or ToM are rarely incorporated explicitly in developing more human-like communication agents, which we argue are center pieces of building a capable goal-driven communicative agents. CICERO [8] proposes a strategy-guided dialogue generation agent to play Diplomacy with modeling other players’ next moves. We argue that most prior work along this line (text games, Diplomacy) is still a more constrained set of scenarios. In Chapter 5, we use Dungeons & Dragons to study goal-driven interactions with ToM. 2.2 Commonsense Reasoning 2.2.1 Why Commonsense Reasoning Commonsense reasoning (CSR) refers to the ability of an AI system to make presumptions about the everyday world, akin to the intuitive knowledge humans possess. This includes understanding physical properties, social norms, and causal relationships that are typically unstated but universally understood. As an integral part of common ground, CSR is crucial for enabling AI systems to interpret and respond to human communication in a way that feels natural and intuitive. Integrating commonsense knowledge into conversational models involves leveraging external knowledge bases and developing methods for implicit knowledge representation. Despite significant progress, current AI systems often struggle with tasks that require a deep understanding of unstated information and the ability to make inferences based on that knowledge. In Chapter 3, I will introduce our method on incorporating commonsense knowledge during grounding for open domain response generation. 8 2.2.2 Evaluating Commonsense Reasoning To provide background on recent advances in commonsense reasoning, here I will discuss one of the main foci in recent years: providing dataset and testbed for deep learning models on commonsense reasoning tasks. The majority of recent commonsense reasoning benchmarks [169, 137, 14, 120] test a model’s ability to choose the correct option given a context and a question; pre-trained language models have reached high performance on these benchmarks after fine-tuning. There have been many benchmarks that focus on reasoning abilities in multiple tasks such as reading comprehension [61, 168] and natural language inference [157], which involve inferences on language. Recent work also aims to probe models in these tasks to see if reasoning is actually achieved [114, 113, 184]. However, on major limitation for these benchmarks is that it lacks grounding in real usage of natural languages such as everyday dialogues. In Chapter 3, we tackle the response generation problem in dialogues, with a focus on collecting commonsense rich dialog data and evaluating commonsense quality of model responses. 2.3 Theory-of-Mind and Pragmatics Theory-of-Mind (ToM) is the cognitive ability to attribute mental states—such as beliefs, intents, desires, and knowledge—to oneself and others. In the context of AI, modeling Theory-of-Mind involves creating agents that can infer and reason about the mental states of their conversational partners. This capability is essential for building goal-driven dialogue agents that can engage in more effective and nuanced interactions. By understanding the mental states of humans, AI agents can better anticipate needs, tailor responses, and achieve communicative goals by reaching common ground. Research in this area focuses on developing algorithms and models that enable AI systems to simulate and reason about the thoughts and feelings of others, a critical step towards more sophisticated and human-like interactions. 9 2.3.1 Theory-of-Mind in Psychology Theory-of-mind has been studied extensively in psychology and cognitive science [104, 11, 43], and clinical psychology tests such as False Belief Test [158] (FBT) were developed to test ToM abilities in children. Rational Speech Act (RSA) framework studies pragmatics between speakers and listeners using a probability perspective [41, 50]. [126] has shown that teaching by simulating the student increases effectiveness. Recent work has looked into ToM and pragmatics as an essential aspect of language usage [89, 71, 105, 42, 119], especially communication [189, 10], thus making ToM a central piece to reach common ground during interaction. 2.3.2 Language Models and Theory-of-Mind New studies aim to answer whether language models (LM) exhibit ToM [119, 68, 144, 141] using False Belief-templated story datasets such as ToMi [71]. Though stories cover limited range of interactions, other sources of ToM tests also face challenges, such as high costs of human-generated interactions [10] and noises in text-game environments [187]. Specifically, some probing work shows that LLMs display some degree of ToM but lack robustness [119, 127]. More recently, [45] proposes BigToM and formalizes probes using causal templates and probes models inferences on others’ beliefs and actions. Other works have investigated ToM for strategic planning [59, 8, 46, 31]. In Chapter 5, we study ToM in a teacher-student setting in a grounded and goal-driven environment and develop reinforcement learning algorithms to enhance language model’s capability to anticipate others’ intents. 2.4 Large Language Models and Prompting This section of background presents discussions on the most widely-used backbones for building communicative agents - Large Language Models (LLM). Large Language Models (LLMs) represent a significant 10 leap forward in natural language processing. These models, such as OpenAI’s GPT-3 and similar architectures, are trained on vast amounts of text data and are capable of generating human-like text based on the input they receive. LLMs have demonstrated impressive conversational abilities, including context understanding, text generation, and even some degree of commonsense reasoning. However, they also face limitations, particularly in understanding implicit intents and reasoning about shared knowledge. The integration of common ground reasoning into LLMs is seen as a crucial advancement for developing more effective communicative agents. By enhancing LLMs with the ability to reason about mutual beliefs and common knowledge, we can push the boundaries of what these models can achieve in terms of natural and intuitive human-AI interactions. I will first discuss about existing efforts on making LLM agents, prompting as an effective way to boost reasoning, and new advances on planning. 2.4.1 Large Language Models and Agents A line of recent work aims to build language agents [80] that can perform “actions”. Actions range from mimicking human social behavior [99], completing tasks using websites [53], and tool using [164, 121]. 2.4.2 Prompting Methods Recent advancements in the area of LLMs have given rise to a plethora of few-shot [17] and instruction [86, 152, 96] prompting techniques, including Chain-of-Thought prompting (CoT) [92, 153], Least-tomost prompting [179, 36], Decomposed prompting [65], Reframing [85], Help Me Think Prompting [87], Stepback Prompting [177] and search-based approaches like Tree-of-Thought (ToT) [163], Graph-of-Thought [12, 165], Branch-solve-merge [117] and RAP [55]. Each of the prompting methods has some strengths and weaknesses in terms of their successful application domain. 11 2.4.3 Reasoning and Planning With the development of various reasoning and planning benchmarks such as GSM8K [28], Math [58], BigBench [133] etc., various methods have been proposed to improve model performance. Often these methods induce specific reasoning structures mimicking the reasoning structure of the underlying task associated with the dataset. For example, chain of thought [153] and scratchpad [92] induce generation of explanations associated with a reasoning question. Similarly other methods induces specific reasoning structures such as question summarization [69], question decomposition [100], program generation [84, 23, 47], etc. However, in a real world user traffic, queries can be diverse covering various reasoning structures. In Chapter 6, we investigate and improve LLM’s reasoning capabilities in diverse tasks by prompting them to self-discover a reasoning structure. In summary, this dissertation builds on these foundational areas to explore the role of common ground reasoning in developing communicative agents capable of rich and effective interactions with humans. By addressing the challenges and leveraging the strengths of current technologies, this work aims to advance the field of AI towards more natural and meaningful human-machine communication. 12 Chapter 3 Improving Language Model Conversational Capabilities by Leveraging Common Sense Knowledge 3.1 Introduction Human communication strives to achieve common ground, consisting of mutual beliefs and common knowledge [134, 27]. Such common ground depends not only on utterances, but also implicit knowledge. For example, in Figure 5.1, this common ground includes the relevant implicit background knowledge “rose is a type of flower”. Integrating such common ground in utterances is an implicit process often referred to as knowledge grounding [26]. Recent state-of-the-art neural response generation (RG) models based on pre-trained language models (LM) mostly produce responses in an end-to-end manner [147, 172, 73], i.e., models are trained to take history and produce a response. Since implicit knowledge is unstated in dialogue history, RG models do not explicitly learn knowledge grounding and may generate uninformative and hallucinated responses [125, 155, 115]. Knowledge-grounded RG [48, 35, 51] addresses this issue, however, most approaches require a knowledge base (KB) to retrieve knowledge for RG [180, 175, 37], which may suffer from the limited knowledge coverage of the used KBs. Some work also casts knowledge as a latent factor in generation [142, 161], which makes it hard to examine the quality of knowledge generation and how exactly RG uses the implicit knowledge, posing interpretability concerns. 13 5HVSRQVH 'LDORJXH +LVWRU\ ,PSOLFLW .QRZOHGJH 3HUKDSV\RXłGEHLQWHUHVWHGLQUHG URVHV ,QHHGWREX\VRPHĠRZHUVIRUP\ ZLIH 5RVHLVDW\SH RIćRZHU 5RVHLVDV\PERO RIORYH 'LDORJXH +LVWRU\ 5HVSRQVH ,PSOLFLW .QRZOHGJH ĆĉĦĜʿêÿĬŎêʿŋêÃĜĉĦĀʩ˘'ŽŋğĉàĉŜğž˘<êĦêŎÃŜĉĦĀ˘ VĦĬŸğêæĀê˘êÿĬŎê˘êŔŋĬĦæĉĦĀ Figure 3.1: A motivating example for our study. We look to train models to externalize the implicit knowledge grounding step by explicitly generating knowledge before responding. We propose Think-Before-Speaking (TBS), an RG framework that trains the RG model to explicitly generate the implicit knowledge and use this knowledge to generate a response, inspired by inquiry-based discovery learning [18]. We argue that this decomposition brings three major benefits: 1) compared with end-to-end RG, generated knowledge augments and/or constrains RG to produce more informative responses; 2) compared with knowledge-retrieval models, explicitly generating intermediate groundings can potentially generalize to knowledge not included in KBs and synergize with the RG process; 3) explicitly generated implicit knowledge used in RG provides a faithful explanation of the response intent. This new RG paradigm poses three main challenges: (1) how to identify implicit commonsense knowledge associated with dialogue turns for training the knowledge generation module; (2) how to represent structured knowledge in natural language (NL) for neural generative models; and (3) how to integrate knowledge and dialogues while distinguishing implicit and explicit parts in responses. To collect knowledge associated with each dialogue instance for training the TBS generative model, we propose weak 14 supervision procedures to automatically align knowledge with each dialogue turn, rather than manually collecting human-annotations, which is expensive and unscalable. This is achieved by using ConceptNet [132] as our knowledge base and different matching approaches to identify the implicit knowledge. We explore several ways to format knowledge originally represented as structured triples into natural language so that RG models can adapt to the knowledge+response generation task easily. We experiment with structured triples, triples converted to natural language, and a more colloquial question answering format. To ensure a smooth transition between knowledge and dialogues, we consider using special symbols or prompts as separators. To evaluate the TBS framework, we introduce new evaluation protocols to cover different aspects of the system, including response quality, knowledge quality, and how TBS models leverage generated knowledge. We conduct extensive human evaluations for different variants of our training procedure. Our experimental results show that our models produce more informative, specific, and responses that make more common sense compared to end-to-end RG models and other knowledge-augmented models such as knowledge-selection. Knowledge quality analysis shows that at least 85% of generated knowledge makes sense and is relevant, and the generated novel knowledge (not in ConceptNet) also has high quality. Furthermore, our TBS model even outperforms an RG model that takes in knowledge obtained using ground-truth responses, showing that explicitly generating implicit knowledge is a promising direction for response generation in open domain dialogue systems. 3.2 Problem Formulation Our TBS RG paradigm extends the traditional RG setting by incorporating an additional component of implicit knowledge in the generation process to externalize the knowledge grounding step in RG. 15 3.2.1 Response Generation We follow the common dialogue response generation setup [154, 131]: given a dialogue history H (a sequence of dialogue utterances), generate an appropriate response R. Current neural RG models often frame this task as a conditional language modeling problem. Specifically, given a history (H) consisting of a sequence of n dialogue turns: X1, X2, ..., Xn (each turn refers to an utterance containing a sequence of ti tokens: xi,1, xi,2, ..., xi,ti ) and a response (R) sentence Y comprised of a sequence of m tokens y1, y2, ..., ym, RG models aim to learn the conditional probability distribution by training on human dialogues: Pθ(R|H) = Ym i=1 Pθ(yi |y” and “</implicit>” and add it between H′ and R, for example, “<speaker1> I need to buy 19 Relation in ConceptNet Relation-Converted NL Information-Seeking QA DefinedAs is defined as What is <concept1>defined as? | <concept1>is defined as <concept2> DesireOf desires What does <concept1>desire of? | <concept1>desires <concept2> HasA has a What does <concept1>have? | <concept1>has <concept2> HasFirstSubevent starts with What does <concept1>start with? | <concept1>starts with <concept2> HasLastSubevent ends with What does <concept1>end with? | <concept1>ends with <concept2> HasPrerequisite requires What does <concept1>require? | <concept1>requires <concept2> HasProperty has the property What property does <concept1>have? | <concept1>is <concept2> HasSubevent requires What subevent does <concept1>have? | <concept1>has subevent of <concept2> IsA is a What is <concept1>? | <concept1>is a <concept2> MadeOf is made of What is <concept1>made of? | <concept1>is made of <concept2> MotivatedByGoal is motivated by What is <concept1>motivated by? | <concept1>is motivated by <concept2> NotCapableOf is not capable of What is <concept1>not capable of? | <concept1>is not capable of <concept2> NotDesires does not desire What does <concept1>not desire? | <concept1>does not desire <concept2> NotHasA does not have a What does <concept1>not have? | <concept1>does not have a <concept2> NotHasProperty does not have the property What property does <concept1>not have? | <concept1>does not have <concept2> NotIsA is not a What <concept1>is not? | <concept1>is not a <concept2> NotMadeOf is not made of What is <concept1>not made of? | <concept1>is not made of <concept2> PartOf is part of What is <concept1>a part of? | <concept1>is a part of <concept2> RelatedTo is related to What is <concept1>related to? | <concept1>is related to <concept2> SymbolOf is a symbol of What is <concept1>a symbol of? | <concept1>is a symbol of <concept2> UsedFor is used for What is <concept1>used for? | <concept1>is used for <concept2> AtLocation is located at Where is <concept1>? | <concept1>is located at <concept2> CapableOf is capable of What is <concept1>capable of? | <concept1>is capable of <concept2> Causes causes What does <concept1>cause? | <concept1>causes <concept2> CausesDesire causes the desire to What desire does <concept1>cause? | <concept1>causes desire of <concept2> CreatedBy is created by What is <concept1>created by? | <concept1>is created by <concept2> Desires desires What does <concept1>desire? | <concept1>desires <concept2> HasPainCharacter has pain character of What pain character does <concept1>have? | <concept1>has pain character of <concept2> HasPainIntensity has pain intensity of What pain intensity does <concept1>have? | <concept1>has pain intensity of <concept2> InheritsFrom inherits from What does <concept1>inherit from? | <concept1>inherits from <concept2> InstanceOf is an instance of What is <concept1>an instance of? | <concept1>is an instance of <concept2> LocatedNear is located near What is <concept1>located near? | <concept1>is located near <concept2> LocationOfAction has location of action at What location of action does <concept1>have? | <concept1>has location of action of <concept2> ReceivesAction receives action of What action does <concept1>receive? | <concept1>received action of <concept2> Antonym is an antonym of What is an antonym of <concept1>? | <concept1>is an antonym of <concept2> DerivedFrom is derived from What is <concept1>derived from? | <concept1>is derived from <concept2> DistinctFrom is distinct form What is <concept1>distinct form? | <concept1>is distinct form <concept2> EtymologicallyRelatedTo is etymologically related to What is <concept1>etymologically related to? | <concept1>is etymologically related to <concept2> FormOf is a form of What is <concept1>a form of? | <concept1>is a form of <concept2> HasContext has context of What context does <concept1>have? | <concept1>has context of <concept2> SimilarTo is is similar to What is <concept1>similar to? | <concept1>is similar to <concept2> Synonym is a synonym of What is a synonym of <concept1>? | <concept1>is a synonym of <concept2> dbpediacapital has the capital city What is the capital city of <concept1>? | <concept1>has capital city of <concept2> dbpediaproduct has product What product does <concept1>have? | <concept1>has product of <concept2> Table 3.1: Knowledge representation mappings. some flowers for my wife. <implicit> rose is a type of flower </implicit> <speaker2> Perhaps you’d be interested in red roses.” Natural language prompts. More recent work has found that NL prompts help LMs to perform better on various downstream tasks, including natural language generation (NLG) [17, 77, 176]. Here we use the NL prompts to prompt RG models to generate implicit knowledge and responses. We use “The following background knowledge is helpful for generating the response:” to elicit knowledge and “Grounded on the background knowledge, what does the speaker probably say in the next response?” to elicit response. 20 3.3.4 Model Training After constructing knowledge-aligned dialogues, each of our data instances is a sequence of tokens with three components: a dialogue history H′ fused with potential implicit knowledge after each turn, implicit knowledge (empty or non-empty) I, and a response R. We split each instance d(H′ , R, I) ∈ D to first train the model to generate just the knowledge I based on H′ , Pθ(I|H′ ), and then train it to generate R based on both I and H′ , Pθ(R|H′ , I). Formally, we follow standard way of modeling Pθ in auto-regressive neural RG models and use Maximum Likelihood Estimation (MLE) to train our model to maximize Pθ(I|H′ ) (knowledge generation KG) by minimizing the conditional negative log-likelihood loss (NLL): LKG = − Xm i=1 log Pθ(Zi |ZMy last day of work was yesterday. I got so emotional driving home and I cried. <speaker2>I’m sorry to hear you were upset! Why did you have to quit? <speaker1>Im going back to school full time. I was there for more than 10 years. I will miss everyone. What is school related to? school is related to college; <speaker2>I’m sorry to hear that. I hope you have a good college experience! <speaker1>I am calling friends to let you know that I am moving out of state. <speaker2>What happened? Why? <speaker1>I got a new job. What is job related to? job is related to work; What is pay related to? pay is related to job; <speaker2>Oh wow! How long will you be working there? <speaker1>There was something moving in the woods. I looked out of the hunting shack and saw movement. What is move related to? move is related to change; <speaker2>Did you change your mind? <speaker1>I love all of you guys! This was the most special performance that I have ever had because of you all! What is happy related to? happy is related to special; <speaker2>Awww, it’s always nice to see a happy couple. <speaker1>Two years ago, I discovered that my now ex-wife was cheating on me with someone she worked with. When I first discovered it, i was angry and hurt. <speaker2>I bet you were, I did that to my husband when I was young. I thought it was the solution. What is man related to? man is related to young; What is young? young is a man; What is man related to? man is related to woman; <speaker1>Yeah, I was so angry and hurt, I thought I was going to be physically ill or something. Table 3.7: Randomly sampled qualitative examples from TBS-hard-symbol-QA. We can find that in most cases the generated knowledge makes sense and responses are grounded in the knowledge and not generic. However, we do find some error cases such as the last example the response does not use the knowledge. generated novel knowledge is similar to that of knowledge existing in ConceptNet. COMET generates more new knowledge but the quality (both common sense and relevance) is significantly lower than TBS models. In general we find that the new knowledge is complimentary to ConceptNet, not just a paraphrased version of existing triples (since in those cases the model will directly generate the ConceptNet triple). This shows a promising sign that TBS RG models can potentially generate good-quality novel knowledge labels for unseen dialogues. 30 Grammatical Coherent Engaging Informative Specific Common Sense Evaluation Dimensions 30 35 40 45 50 55 60 Preference Percentages 50.0* 55.0* 53.3* 43.7 44.7 53.0* 38.0* 38.3* 40.3* 39.6 46.0 43.3* Effects on Noisy Knowledge Input Models TBS TBS-Noisy Knowledge Figure 3.8: Effects of noisy knowledge on response quality. 3.5.3 Performance Analysis Most responses are knowledge grounded To examine how TBS methods leverage knowledge for RG, we also present annotators a history, generated knowledge, and generated response, and ask them whether the knowledge is used in response. We find that around 77% of generated knowledge is used in the generated response, i.e., the response is grounded in the knowledge generated from TBS. Noisy knowledge heavily impacts quality To better showcase the connection between knowledge and response, we examine how knowledge quality generated from TBS methods can affect response quality. During inference, we randomly sample noisy knowledge from another dialogue, feed it to the model to generate a response conditioned on irrelevant knowledge, and compare the response quality with response generated from TBS knowledge. Fig 3.8 shows that there is a statistically significant (p ≤ 0.05) drop in response quality in four dimensions. This indicates that the quality of knowledge input heavily influences response quality and that TBS models generate better responses because of its decent knowledge quality. 31 Qualitative examples and limitations We show several qualitative examples from different models and human responses in Table 3.6. We find that TBS generates relevant knowledge and responses grounded properly in that knowledge, whereas KS/KG models retrieve noisy knowledge and Hard-GT generates response not grounded in knowledge. Here we present a summary of error patterns of TBS models and discuss potential directions to improve. More examples can be found in Table 4.4. First, our matching procedures do not concern multi-hop triples that might be needed for complex reasoning chains. Second, ConceptNet mostly contains taxonomic and lexical knowledge (“RelatedTo, IsA, etc”), limiting the diversity of generated knowledge from TBS models. We plan to explore other knowledge resources such as ATOMIC2020 [63] in the future. Third, currently the model always generates implicit knowledge. In future work, we are interested in training RG models that understand when implicit knowledge is needed based on the dialogue context. 3.6 Conclusion Inspired by how humans contribute to the common ground during communication, We propose to train RG models that explicitly generate implicit knowledge and then respond (TBS). This brings us three main benefits compared with prior end-to-end RG models: 1) more informative and coherent responses by augmenting with knowledge; 2) generated knowledge provides faithful explanations of RG model’s innerworkings; 3) models do not rely on external knowledge bases in response generation time. We first identify implicit knowledge in dialogues, explore different knowledge representation and transition choices, and demonstrate promising results compared with end-to-end and knowledge-grounded RG models from extensive evaluations. We find strong and promising results for TBS RG model compared with end-to-end RG. In particular, TBS can produce good quality and novel knowledge, outperform end-to-end RG models despite training on less data, and even produce better responses than RG models that take ground-truth 32 knowledge. We hope our findings encourage more future studies on making RG models better emulate human communication process and produce better-quality responses. 3.7 Ethics and Broader Impact Our work aims to train RG models that explicitly generate implicit knowledge before responding. [128] have found biases in DialoGPT (our base model) responses and [81] have found representational harms in common sense resources. We acknowledge that the generated responses from our models might contain biases. All of the dialogue datasets and models are in English, which benefits English speakers more. We have conducted human evaluation using Amazon Mechanical Turks. We pay turkers around $15 per hour, well above the highest state minimum wage and engage in constructive discussions if they have concerns about the process. We also give each annotation instance enough time so that we do not pressure annotators. 33 Chapter 4 Inference-Based Diverse Common Ground Further Improves Dialogue Response Quality 4.1 Introduction Human communication is a collaborative effort [52, 2, 15] where participants strive to achieve common ground (CG), consisting of mutual beliefs and common knowledge [134, 27, 26]. Conversational AI systems, while able to produce fluent texts, often generate generic and dull dialogue responses [125, 174], potentially because they do not explicitly model CG in communication (as illustrated in Figure 5.1). Specifically, existing models mostly follow a dialogue history → response training paradigm since such data can be easily obtained in the wild, skipping an important middle step that builds common ground, which naturally and universally exists in human communication, i.e., dialogue history → common ground → response. Moreover, the same history can yield numerous responses, predicated on the CG and intent of the responder. We conjecture that the omission of modeling CG explicitly is a crucial bottleneck in RG models because they are directly trained to produce responses without learning how and why those responses are uttered. Modeling common ground between speakers, however, is challenging due to its implicit and subjective nature during conversations[27]. Prior work on representing CG either mines noisy commonsense 34 Oh no, I spilled the food I prepared for dinner They might be feeling bad and need help cleaning it up Sorry to hear that! (okay.. they don’t seem so interested ) Don’t worry! How about let’s clean it up and order from your favorite pasta place? Aww, thank you! I feel much better now! RespondinĀ “reflexively” produces dull and Āeneric conversations MakinĀ inÿerences to reach common Āround produces more meaninĀÿul and enĀaĀinĀ conversations! Figure 4.1: A motivating example. We aim to help RG models produce more human-like responses instead of generic ones. We argue that integrating common ground by making inferences is crucial. knowledge triples between dialogue history and existing responses [183] or collects human inferences after reading the whole dialogue as a bystander [49]. Such approaches provide useful augmentation, but post-hoc analysis cannot mirror the generative process and intent of diverse human dialogue. Figure 4.2 illustrates three paradigms for RG. We argue that truly modeling this generative process requires (1) articulating CG prior to the response; (2) generating responses conditioned on CG; (3) differentiating response generation based on different types of CG. 35 To this end, we formalize common ground in dialogues asinferences made by one participant to approximate potential beliefs shared by other participants, as shown in Figure 5.1. In this work, we instantiate inferences as question-answer (QA) pairs in natural language (NL) such as“What might happen later?” “They might need to clean the floor” to elicit others’ beliefs, inspired by inquiry-based dialogic learning [18, 54, 156]. Another critical aspect of CG is its multi-dimensional nature, i.e., given the same dialogue context, different plausible inferences can be made, which then lead to different responses. Following these principles, we create a novel dialogue resource with multiple explicitly human-annotated common ground, each of which is further substantiated as a next-turn response continuing the conversations (an example of expanded CG and responses for one context shown in Figure 4.3). We design a two-stage data collection process by first asking crowdsourcing workers to answer different inference questions eliciting beliefs about CG (e.g., what is the speaker feeling right now?) Answers rely on common sense, and adopt the point of view of the conversational respondent. We use these QA pairs to approximate various (non-exhaustive) inference dimensions to extend the common ground (e.g., empathy and event causality). Our second step converts these CG into dialogue responses by asking different workers to write a coherent response based on the answer/inference collected in the first stage. Our collected data Reflect contains 9k diverse responses from 600 dialogue contexts, based on 5 inference dimensions for CG. Using Reflect, we first test our hypothesis that explicitly modeling CG and using CG to construct responses creates more engaging conversations. We conduct human evaluation to compare the quality of responses between Reflect and “reflex” style datasets and models in terms of sensibility, specificity, and interestingness. We find that, compared to reflex-prone human-written and machine-generated dialogues, our two-stage data collection process results in more responses that are sensible, specific, and interesting as rated by humans. This highlights limitations of existing data collection procedures and models trained on the data. 36 Diverse Human-like RG Post-Hoc Annotations Dialogue History Response Common Ground Dialogue History Response Ours (Reflect) Reflex RG Process Dialogue History Response Post-Hoc (TBS, CICERO) CG Augmentation Common Ground Response Common Ground Response Common Ground Figure 4.2: Illustration of different RG approaches. Common RG does not model CG explicitly, TBS [183] and CICERO [49] post-hoc augments dialogues with CG, and we aim to follow natural human communication and first collect CG and then responses based on CG. We also factor in the diversity of plausible responses given a dialogue context that result from distinct CG. Next, we look to study the potential of explicitly modeling CG in dialogue systems to help build models that can create more engaging conversations. As a case study, we use the inference dimensions from Reflect and test two simple ways to guide RG using CG. We surprisingly find that simple approaches such as appending an inference question to the dialogue context before the response in the few-shot (FS) incontext examples (from Reflect) help GPT3-175B [17] generate almost 30% more responses that are deemed sensible, specific, and interesting than vanilla FS learning GPT3 (no inference question). We demonstrate that, when prompted to “think” about an inference question (approximated CG), large models such as GPT3 can create more engaging conversations. We also find that such effect is only shown in large models like GPT-3 as we find BlenderBot-440M [115] benefits from fine-tuning on Reflect, but appending inference questions does not further increase response quality. In summary, our contributions are as follows: 1) we operationalize theories about common ground and formalize them for dialogue; 2) we collect the first large-scale (9k responses) dialogue dataset with diverse responses guided by CG and release this resource to facilitate training and evaluation; 3) we show 37 important limitations of existing dialogue data and RG models that detract from engaging communication; 4) we demonstrate that CG can dramatically improve RG quality even with simple prompting, boosting quality by 30%. The resources and results from this work promise to enable the research community to create and evaluate common ground-aware RG models. 4.2 Inference-Based Common Ground We formally introduce the notion of common ground in conversations as the implicit variable conditioned on dialogue history and provides conditions to the next-turn response. 4.2.1 Grounding in Communication Successful collaborative communication activity relies on mutual understanding of shared knowledge and beliefs [26, 15] called common ground. However, due to least collaborative effort [52, 27] where communication participants try to minimize the effort spent on contributing to the interaction, establishing CG relies on signals other than the surface communication information (i.e., actual utterances in a conversation). While humans in face-to-face communication receive some information from non-verbal signals such as gestures and facial expressions, virtual systems such as chatbots often do not have access to such signals. Thus, we argue that they have to rely heavily on another crucial way of getting signals for establishing CG: making inferences based on the surface communication utterances and common sense, in order to approximate two humans talking to create engaging conversations. Furthermore, building CG by making relevant inferences also connects closely with the “dual process” theories of human reasoning [135, 38, 64]. We argue that the “reflexive” RG is mostly modeling “System 1” which is intuitive and associative, but a more deliberative and logical “System 2” is lacking. 38 … Stage 1: … Inference Collectio n CommonsenseFocused Partial Dialogues Jordan: How'd you do on the exam? Friend: I failed, so did everyone else I asked Jordan: I got an A and I think if my friends studied harder they could too Friend: Yeah, I don't know about that but I'll definitely study more for the next one How would you describe Jordan? Arrogant What might have happened before? Jordan worked hard studying Stage 2: Response Collection “Since you're so smart why don't you just go home and spend more time studying.” … … … … “Look at the grin on your face. Don’t let pride consume you, Jordan!” “Failing the test just bums me out so much.” “I'm glad you aced it but you're the first person I've heard say it was easy.” “Since you're so smart why don't you just go home and spend more time studying.” “I'm glad you aced it but you're the first person I've heard say it was easy.” “I'm glad you aced it but you're the first person I've heard say it was easy.” “Look at the grin on your face. Don’t let pride consume you, Jordan!” “Look at the grin on your face. Don’t let pride consume you!” “Failing the test just bums me out so much.” “Failing the test just bums me out so much.” “I imagine you pulled an all-nighter before the exam?” What might happen after? Jordan's friends maybe mad at him and not want to speak to him for awhile. What is Jordan feeling now? Proud What is Friend feeling now? Sad “Since you're so smart why don't you just go home and spend more time studying.” Figure 4.3: Reflect collection procedure illustration. We first collect CG materialized as inferences expressed in QA along different dimensions. Then for each QA pair, we collect multiple responses. 4.2.2 Formulating CG in Dialogue Consider three high-level components in communication efforts: context C (often materialized as dialogue history consisting of a sequence of n contributions C = c1, ..., cn), common ground G, and a new contribution continuing the context (often referred to as a “response” cn+1. Specifically, for common ground G, we focus on signals gained from inferences and thus materialize G as a list of m potential inferences G = I1, ..., Im conditioned on the context. We furthermore materialize each inference as a QA pair in NL Ij = (Qj , Aj ) (examples included in Figure 4.3 between Stage 1 and 2). We use QA format to express inferences to mimic inquiry-based dialogic learning [18, 54, 156] and follow empirical evidence that neural models take in QA-format knowledge effectively [130, 183]. 4.3 Collecting Reflect Data Here we describe how we collect Reflect, a novel large-scale dialogue dataset with diverse human-annotated inference-based CG and grounded responses. An overview of the procedure with examples are shown in 39 Inference Dimensions Inference Questions Attributes of Speaker How would you describe Speaker? Potential prerequisites What might have happened before? Potential consequences What might happen after? Speaker Emotion States What is Speaker feeling now? Responder Emotion States What is Responder feeling now? Table 4.1: Inference dimensions and corresponding questions Figure 4.3. We first select base dialogues from a dataset that is constructed without considering CG and only has one plausible response for each context (4.3.1). Then we aim to expand and collect multiple responses based on different inference dimensions. We introduce a two-stage process to first crowdsource potential inferences people make in conversations (4.3.2) and then ask a second cohort of workers to generate diverse responses based on the inferences (4.3.3). We designed a two-stage data collection to 1) collect multiple, diverse responses based on each CG; 2) to allow response writers to validate CG as high quality, generic common sense inferences. Finally, we include discussions of data quality assurance (4.3.5). 4.3.1 Pre-Collection: Selecting Base Dialogue Turns for Expansion Our first step is to select base dialogues and dialogue turns to expand on, in terms of both inferencebased CG and more potential responses following the CG. One important criterion for base turns is that they should not be “social glue” turns such as “You are welcome” in responding to “Thank you!” We aim at expanding turns that have semantically-rich dialogue context, enabling different plausible inferences to be made. After investigation of existing datasets, we use dialogues from Commonsense-Focused Dialogues [182] that are converted to dialogues from SocialIQA [120] contexts. We chose this dialogue data because SocialIQA (crowdsourced from ATOMIC [118], an if-then inferential commonsense knowledge base) contains everyday situations where people can make various inferences on. Then, to select what turns to expand on, we use simple heuristics and select the turn that has the largest semantic overlap with the event in SocialIQA using SentenceBERT [112]. 40 4.3.2 Stage 1. Inference Collection Our first goal is to collect potential inferences people might make (e.g. “they might be feeling bad”) given conversation contexts C to approximate common ground. Each inference Ij is further materialized as a QA pair (Qj , Aj ) along multiple inference dimensions as formulated in Section 4.2.2. Inference Knowledge Schema We adopt inference dimensions from ATOMIC2020 [63] since it focuses on social commonsense inferences based on everyday scenarios. Specifically, we conduct a pilot study to choose 5 dimensions from the 14 dimensions, consolidating those that overlap (e.g., “what might happen later” and “what would others likely want to do after”) in the context of dialogues. Our final five dimensions for conversation-based inference dimensions are shown in Table 4.1. Crowdsourcing Our Stage 1 crowdsourcing task is: given a dialogue context, imagine that you are participating as the responder and write answers to the 5 inference questions. We recruit a group of around 30 crowdsourcing workers from Amazon’s Mechanical Turk platform (AMT) who are native English speakers and provide detailed feedback. Specifically, after carefully reading collected inferences from pilot studies, we provide feedback to turkers by stressing on several principles to make the inferences collected more closely approximate CG, shown in Figure 4.4. 4.3.3 Stage 2. Response Collection After the first stage, we have collected 5 inferences (approximating CG) in the form of QA pairs for each dialogue context. Our next step is to collect next-turn responses given both the dialogue context and the collected inference-based CG along different dimensions. To account for diversity in responses, for each dialogue context we ask three Turkers to write a next-turn response based on each of the given inferences, yielding 15 responses for each dialogue context. Similarly to Stage 1, we communicate our collection principles to workers to improve the collected data quality (Figure 4.4). 41 Stage 1. Inference Collection Principles CommonSensical RG-Driven Non-Trivial Inferences should help with generating an engaging response ⇒ “A might need to order food”; ⇏ “spaghetti is a type of food” Inferences should NOT be directly copy pasting from context ⇒ “A is feeling upset”; ⇏ “A dropped food” Inferences should follow common sense ⇒ “A need help to clean up the mess; ⇏ “A will eat food on floor” Stage 2. Response Collection Principles Towards Engaging Convs NonParaphrastic Coherent + Grounded Responses should NOT directly paraphrase the inference such as “I think you are feeling relieved'' from “Speaker is feeling relieved'' Responses should be both coherent from context and be based on the inference Ultimately, we want responses that lead the conversations that are more enjoyable and engaging Figure 4.4: Crowdsourcing principles for two-stage collection. 4.3.4 Data Collection Details We engage in active discussions with them in the TurkerNation∗ Slack channel and provide detailed feedback after multiple rounds of pilot study to ensure the data quality. Inference Collection Here we present more detailed feedback for AMT workers on Stage 1. inference collection: First, we stress that the goal of these answers is to help with generating a response to continue the conversation instead of any inferences that might not be useful for directly generating engaging responses, such as “spaghetti is a type of food” for the example in Figure 5.1. Secondly, the answers should ∗ https://www.reddit.com/r/TurkerNation/ 42 Resources Source Makes Sense Relevant TBS [183] ConceptNet 83.7% 81.0% CICERO [49] Human 86% 96% Reflect (Ours) Human 93% 96% Table 4.2: Human evaluation on inference (CG) quality. We compare inferences from three resources and compare their sensibility and relevance to dialogue context. not be a direct copy-paste of some parts in the dialogue context as those would be trivial to collect, violate the least collaborative principle and the maxim of quantity [52], and and should not be worth making inferences over. Finally, we remind them that the inferences written should be considered as “common sense” so that the approximated CG is more likely to become shared knowledge and beliefs among the dialogue participants. Collection UI and provided examples for turkers are shown in Figures 4.12 and 4.13. Response Collection We specifically stress on several points to workers: 1) to collect more engaging and interesting responses, response should not directly paraphrase the inference such as “I think you are feeling relieved” from inference QA pair “What is speaker feeling now? Speaker is feeling relieved”; 2) the response should be both coherent to the dialogue context as what would be naturally uttered by the responder and based on the reactions to lead the conversation in an interesting direction; 3) Ultimately, we want responses that lead the conversations that are more enjoyable and engaging. Collection UI and provided examples for turkers are shown in Figures 4.14 and 4.15. 4.3.5 Quality Control and Analysis Quality check for Inference Collection In our second stage for response collection, we ask workers an additional question before writing a response: “do you think the shown inference answer is a valid reaction from the responder?” as a way to check the quality of the first collection stage results. We find that less than 7% (200/3000) of the inferences are deemed implausible by second stage workers and only keep the inferences where most workers agree that the inferences are plausible. 43 Quality check for Response Collection To check quality for our stage 2 response results, we randomly sampled around 5% of collected responses (500) and conduct a manual in-house check for two criteria: 1) is it a sensible continuation from the dialogue context? and 2) is the response based on the inference given? We find that around 93% of the responses are a sensible continuation and 89% are following the inferences given. Further human ratings of our collected grounded dialogue responses showing that our data improves the sensibility, specificity, and interestingness aspects compared to the base responses are included and discussed in Section 4.4. Comparison to prior work on representing CG We compare CG inferences from Reflect with TBS [183] and CICERO [49], two prior work that aims to represent CG in dialogues using either ConceptNet [132] knowledge triples or post-hoc human annotations, respectively. Note we only compare inferences (CG) since neither collects new dialogue responses grounded in the inferences, and only consider a single response per context. Comparison results on sampled 100 inferences for each resource are shown in Table 4.2 where we find that inferences in Reflect are rated as make more sense and relevant to dialogue context than the prior dataset. 4.4 Limitations of Reflex-Prone Dialogue Data and Models Most existing open-domain dialogue datasets are either crowdsourced by workers who do not have strong incentives to create engaging conversations [110, 182] or crawled from language learning websites and exams [74, 30]. Both lack explicit CG. These collection processes can fail to capture engaging human-like conversations through under-specified response criteria. Accordingly, RG models trained on these data may mimic generic patterns. This section aims to demonstrate such limitations by comparing responses from Reflect with responses from both the original dialogue dataset we expand on and models trained on the dialogue history → response regime. 44 Dimensions Positive Examples Negative Examples Sensibleness That’s too bad! Thank you. Specificity Did you spill it in the kitchen? Let me help! Do you need help? Interestingness It’s actually blessing in disguise, wanna guess why? Let’s eat something else. Quality (SSI) It’s blessing in disguise, since I ordered extra from your favorite pasta place! All above Table 4.3: Evaluation dimensions for RG with examples (dialogue context from Figure 5.1). 4.4.1 Human Evaluation Dimensions-SSI We evaluate the quality of each response by head-to-head comparing across systems along several evaluation criteria. We follow the protocol used by LaMDA [139] and measure SSI: sensibleness, specificity, and interestingness. Examples of positive and negative responses are shown in Table 4.3. Our assumption is that responses that contribute to more engaging conversations should satisfy all three dimensions and we refer to them as quality responses. We do not consider automatic metrics since they do not yet reliably replace human judgements on open-ended responses, especially for fine-grained evaluation dimensions. Specifically, a sensible response is one that is reasonable in context. A specific response is one that relates closely to the given dialogue context, instead of a generic one that can be applied in dozens of different contexts. An interesting response can “catch someone’s attention or arouse their curiosity, or if it is unexpected, witty, or insightful.” [139]. For more detailed instructions, please refer to [139]. Evaluation UI and provided examples for turkers are shown in Figures 4.16 and 4.17. 4.4.2 Comparing Original vs Reflect Responses First, we compare the quality of responses in previous dialogue datasets with our Reflect responses to analyze the effects of explicitly incorporating CG in human RG. Here we present results by adopting the aforementioned evaluation protocol on human dialogues, both from the original base dialogues [182] and from our Reflect dataset, derived from the same dialogues. We sampled 300 dialogue contexts and asked 45 Figure 4.5: GPT3-Few Shot Prompting Format (no inference). Figure 4.6: GPT3-Few Shot-Inference Question Prompting Format. 3 crowdsourcing workers to rate the three SSI criteria, using majority voting to get final scores (Fleisskappa [40] agreement is around 0.67). We compare the original next-turn response from the contexts with a randomly sampled one from our Reflect responses. Reflect contains more specific and interesting responses than original dialogues From human evaluation shown in Figure 4.8, we observe that our collected Reflect data, consists of dialogue responses that are on average more specific (20%) and interesting (13%) than the original data, while having slightly lower sensibility (4%) ratings. One possible contributor to the lower sensibility may be 2-stage collection 46 Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 92 82 60 56 90 68 74 48 80 68 72 44 Response Quality for Inference Dimension "How would you describe A?" Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 74 88 48 46 94 78 66 50 70 72 84 48 Response Quality for Inference Dimension "What might have happened before?" Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 86 74 58 44 92 76 76 62 76 68 70 50 Response Quality for Inference Dimension "What might happen after?" Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 88 78 62 44 92 78 64 52 80 70 72 42 Response Quality for Inference Dimension "What is A feeling now?" Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 90 84 68 58 96 76 80 58 80 64 78 50 Response Quality for Inference Dimension "What is B feeling now?" Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Figure 4.7: Response evaluation separated by inference dimensions. We find that GPT3-FS-InfQ generate better responses than humans on the potential consequences dimension while generates worse on attributes. where a new worker continues dialogues constrained by a specific inference generated by another person. Specifically, when comparing the percentages of responses that satisfy all three criteria, i.e., quality responses, we find that there are substantially more (18%) such responses in Reflect than in original data. This observation raises an interesting question: “do existing dialogue training datasets capture high quality dialogues?” Without sensible, specific, and interesting responses to learn from, RG models will necessarily be limited in the quality of their output. 4.4.3 Comparing Reflex RG vs Reflect Data We now compare Reflect with RG models trained on dialogue data that lacks explicit CG and to directly generate an utterance given a context. 47 Sensible Specific Interesting SSI Evaluation Dimensions 40 50 60 70 80 90 100 Percentage 96 68 64 49 92 82 70 58 Response Quality between Original Dialogue Data and Our Collected Reflect Human Dialogues Orig-Dialog Reflect Figure 4.8: Human evaluation comparing human dialogues: original data and our collected Reflect. Reflexive model baselines Specifically, we consider models from two categories: medium-sized RG models pre-trained on dialogue data such as BlenderBot (440M parameters) † [115] and large-sized language models (LLM) pre-trained on general texts such as GPT3-DaVinci (175B parameters) ‡ [17]. We directly use off-the-shelf Blender since it is pre-trained on dialogue data (Blender). For GPT3-175B, we apply few-shot in-context learning by providing 3 examples of dialogue context and response from existing data (GPT3-FS). We manually examine these responses to ensure their quality as demonstrating examples. Then we present a dialogue context from our test data and prompt GPT3 to generate a next-turn response. Models with no common ground struggle Unsurprisingly, as shown in Figure 4.9, we find a similar trend as comparing Reflect with original dialogue data: both BlenderBot-FT and GPT3-FS generate much fewer quality responses (53% and 38%, respectively) that satisfy all criteria and particularly on specificity. This further supports the hypothesis that RG models that learn from no-grounding dialogue responses struggle to capture what constituted meaningful conversations. † https://parl.ai/projects/recipes/ ‡ https://beta.openai.com/docs/models/gpt-3 48 Sensible Specific Interesting Quality Evaluation Dimensions 30 40 50 60 70 80 90 100 Percentage 92 82 70 58 87 61 63 42 85 45 69 38 Response Quality between Reflect and Reflexive RG Models Models Reflect GPT3-FS Blender Figure 4.9: Averaged performance comparing Reflect with reflexive models. Predictably, RG models trained to directly output responses from dialogue context using no-common ground responses have a large gap in response quality compared to human responses guided by inferences. 4.5 A Little CG Goes a Long Way After showing that explicitly integrating inference-based CG helps humans produce more specific and interesting dialogue responses, we now test if this also holds for neural RG models. We take the nonexhaustive inference dimensions we used in Reflect as case studies to see how CG could improve the quality of existing RG systems’ responses, in terms of the SSI human evaluation [139]. 4.5.1 Experiment Setup Inference-Guided reflect models We attempt to shift models from “reflexive” RG to “reflective” RG by taking into account of plausible inferences that humans use to build common ground during communication. Since both BlenderBot and GPT3 are trained to generate responses directly without integrating common ground, a non-trivial challenge is how to adapt them to use inference-based common ground before RG. Here we present our two intuitive and simple approaches. 49 Dialogue History A: One of my patients keeps forgetting to take his medicine on time. B: That’s not good, how can you help someone like that? Why did you have to quit? A: I’ve been calling him everyday to make sure he takes his medicine, I have noticed his health is improving too. Orig Response B: You’re a great nurse! Reflect Q: How would you describe A? Ans: A is kind, patient and persistent. B: You are really determined and wonderful with your patients. Not many doctors would take the time every day to do something like that. Q: What might have happened BEFORE? Ans: A noticed a patient’s health wasn’t improving with their medicine. B: What made you notice he wasn’t taking it? Did they need a refill by your notes but still have a lot left? Or did you notice they weren’t better at their last appointment? Q: What do you think might happen AFTER? Ans: A will contact a family member of the patient to help their patient remember to take meds.. B: How will you handle it going forward? Can you keep up with that, or will you see if someone else close to them can help? Q: What do you think A is feeling? Ans: Satisfied, successful and determined. B: You must be feeling really confident and strong right now. You’re a doctor who goes above and beyond for your patients. I’ve never met another who would do this. Q: What do you think B is feeling? Ans: Impressed and optimistic for A. B: Wow, that’s so amazing. I can’t believe you’ve done at all that. Most doctors would just assume it’s the patients responsibility. I am confident you’ll get him better after this. GPT3-FS-Top5 B: That is amazing. You have to be a great nurse to help people like that. B: You are a true caregiver and I’m glad you’re taking care of your patients so well. B: That’s good, once he stops forgetting you can give him back his phone and he can do it on his own. B: Well, you are a good doctor. B: That is good. GPT-FS-InfQ Q: How would you describe A? B: You are amazing! I could never do that job. Q: What might have happened BEFORE? B: That’s really good that you took the initiative to make sure he was taking his medicine Q: What do you think might happen AFTER? B: You are really dedicated to your job. I hope you get a raise soon. Q: What do you think A is feeling? B: It’s great that you are helping him out and that his health is improving. Q: What do you think B is feeling? B: I’m glad that the patient is improving, but it’s a lot of work for you. Table 4.4: Randomly sampled dialogue example to qualitatively compare responses from different resources and models. Compared to original and GPT3 top-5 responses which mostly express “You are a great nurse/doctor”, Reflect responses based on the inference QA pairs are much more specific and engaging. We find that GPT3-InfQ, though only guided by an inference question, generates less genric responses such as “... I hope you get a raise soon.” For BlenderBot-440M, we follow the common practice of fine-tuning models to adapt to a new task format. We split our Reflect data into 60/10/30 for train/valid/test and first fine-tune BlenderBot-440M (Blender-FT) on only the collected responses to show potential benefits of training from inference-guided human responses. Then we fine-tune BlenderBot but modify the training task from outputting responses from contexts to inference-guided RG. Inspired by modular generation in dialogue RG [1, 183, 129], our training task is: given dialogue context and one of the five inference dimension questions, generate the answer as well as the response collected in Reflect (Blender-FT-InfQ, indicating that the model is given the Inference Question). For GPT-175B, we follow the few-shot in-context learning approach with one small addition in input: we append the dialogue context with an inference question and ask the model to generate a response. Our pilot studies show that GPT3 tends to generate directly an answer to the question, not a next-turn response to the dialogue context, thus we format the question into a prompt for GPT3 and stress that the end goal is RG. Specifically, we append the text “Think about this when responding: ” and then one of our inference 50 Sensible Specific Interesting Quality Evaluation Dimensions 30 40 50 60 70 80 90 100 Percentage 87 61 63 42 93 75 72 54 85 45 69 38 83 69 70 48 77 68 75 47 Response Quality Comparing Reflexive RG with Inference-Guided Models Models GPT3-FS GPT3-FS-InfQ Blender Blender-FT Blender-FT-InfQ Figure 4.10: Averaged performance comparing before and after reflexive models leveraged inference-guided Reflect data. We find that inference-based common ground prompting helps GPT3- 175B significantly, even comparable to human responses from Figure 4.8. Blender also gained much improvement from pre-trained after fine-tuning on Reflect, however we find no significant effects on appending inference questions. questions after the dialogue context to prompt GPT3 to generate a response by reflecting on the questions (GPT3-FS-InfQ). Illustrative figures for prompting GPT3 are shown in Figures 4.6. To compare and analyze the effects of each inference dimension, we randomly sample one response for each of the five inference dimensions for GPT3-FS-InfQ and Blender-FT-InfQ and take their average. For GPT3-FS, Blender, and Blender-FT, we pick the top 5 responses generated using their default decoding strategy (beam search for GPT3 and nucleus sampling for Blender) and aggregate their evaluation results. In total, we evaluate 250 responses from each model following the procedure in Section 4.4.1. Implementation Details We use two base models in our paper: BlenderBot-440M and GPT3-175B. For BlenderBot, we use the ParlAI [miller2017parlai] package for pre-trained modeling and fine-tuning. The format for fine-tuning BlenderBot on inference questions is: input sequence is “<speaker1> ... <speaker2>... <speaker1>... <infq> What might have happened before?” and output sequence is “<infa>... <speaker2> ...”, 51 where we use “<infq>”, “<infa>” to indicate the start of an inference question and answer, respectively. We fine-tune BlenderBot-440M for 3 epochs with batch size 16 and set the learning rate to be 1e-06. We perform gradient accumulation for 8 steps and gradient clipping with a max norm of 1.0 and optimize using the Adam optimizer. For decoding, we use top-p nucleus sampling [60] with temperature T (p = 0.9 and T = 0.7), and a maximum decoding length of 300 tokens. BlenderBot-440M models are mostly trained on 4 Quadro RTX 8000 GPUs and take around 9 hours. We use OpenAI-API § to access GPT3-DaVinci (175B) and include prompting formats for GPT3-FS and GPT3-FS-InfQ in Figures 4.5 and 4.6, respectively. 4.5.2 Experimental Results Prompting GPT3 to “think” about common ground improves response quality by 30% Figure 4.10 presents results when comparing models that has no access to inference-guided Reflect data with those that do. We test the hypothesis that whether guiding RG models with inference questions about common ground is helpful for generating more human-like responses. We find that with inferences, GPT3-FSInfQ outperforms GPT3-FS on all evaluation dimensions. Specifically, inference-guided GPT3 produces almost 25% more specific and 30% more quality responses. Moreover, 54% quality (sensible, specific, and interesting) responses already surpasses quality of human-written responses in original dialogues (49%), but still lags behind Reflect (58%) as shown in Figure 4.8. Fine-tuning Blender on Reflect generates 26% more quality responses For BlenderBot-400M, we find that fine-tuning on inference-guided human responses from Reflect helps generate almost 50% more specific and 26% more quality responses. In contrast to GPT3, BlenderBot with inference-guided finetuning does not seem to improve much. We speculate that model size might play a role in how much § https://beta.openai.com/playground 52 Attr_Speaker Prerequisites Consequences Emotion_Speaker Emotion_Responder Evaluation Dimensions 30 40 50 60 70 80 Percentage 56 46 44 44 58 48 50 62 52 58 44 48 50 42 50 Percentage of Quality (SSI) Responses on 5 Inference Dimensions Models Reflect GPT3-FS-InfQ Blender-FT-InfQ Figure 4.11: Response evaluation separated by inference dimensions. We find that GPT3-FS-InfQ generate better responses than humans on the potential consequences dimension while generates worse on attributes. model is influenced by CG inferences, leaving future work for more inference-customized fine-tuning on moderate-sized models. 4.5.3 Analysis Which inference dimension helps models the most (and which the least)? Figure 4.11 shows the percentages of quality responses separated by the inference dimension we use to prompt humans and models. Interestingly, we find that on some dimensions, GPT3-FS-InfQ can produce significantly better responses than human responses from Reflect, especially event-based: “What might have happened before” and “what might happen after?” and emotion-based CG about the other speaker “What is A (speaker1) feeling now?”. However, on “How would you describe A”, humans responses grounded on this question are much better. This dimension-specific analysis provides evidence that neural models’ capability to generate quality responses may depend on what types of CG we use to guide them. Prompting GPT3-175B with complete human inferences To show how well GPT3 can make use of complete human-annotated common ground, we further append the inference answer after the question 53 from Reflect data and prompt GPT3 to generate a response given the fully materialized common ground. As expected, we observe further improvements in response quality especially in specificity (15% more) and general quality (16.7% more). This analysis shows promises to make reflect-style models produce better responses by providing quality inference answers for CG. 4.6 Conclusion We introduce Reflect, a dataset with diverse inference-grounded responses inspired by CG and communication theories. We carefully design our two-stage collection process and apply quality control. Then we demonstrate limitations of existing dialogue data and models trained on it. Finally, we present promising signs that guiding models with CG results in more engaging conversations. We hope to encourage more work on improving RG quality by looking at how humans use CG and adapt the communication process to machine learning models. Future directions include providing a ranking of inference dimensions depending on dialogue context and train models to generate responses following the most suitable dimension. Reflect also enables potential automated metrics to evaluate response since more responses per dialogue might help gauge the plausible response space given a context. 4.7 Ethics and Broader Impact We collect a new dialogue dataset in English, which benefits English speakers more. We use Amazon Mechanical Turk to recruit crowdsourcing workers and we pay workers over $15/hour on average, well above the highest state minimum wage and engage in constructive discussions if they have concerns about the process. We also give each annotation instance enough time so that we do not pressure annotators. In our quality assurance process for this dataset, we also examine potential harmful biases and aggressive 54 Figure 4.12: Inference collection UI. languages in responses and remove them in the final dataset. We also acknowledge that the generated responses from our experimented models might contain biases. 55 Figure 4.13: Inference collection examples for turkers. 56 Figure 4.14: Response collection UI. 57 Figure 4.15: Response collection examples for turkers. 58 Figure 4.16: SSI evaluation UI. 59 Figure 4.17: SSI evaluation instructions. 60 Chapter 5 Modeling Theory-of-Mind For Goal-Driven and Grounded Communication 5.1 Introduction Humans communicate with a goal in mind and use language to reach the goal by interacting with their communication partners grounded in a shared environment [52, 2, 27, 26]. To make sure the goal is reached, we often anticipate how the partners will respond in advance to steer the conversations in the desired direction. This ability to reason about the mental states of conversation partners – i.e., theory-of-mind [ToM; 104] – is key to smooth and efficient communication [101, 56]. Most existing dialogue agents, while able to produce human-like responses, often do not model communicative intents or ToM explicitly. In this paper, we investigate if models benefit from explicitly incorporating intents and ToM in NLG. To bridge the gap between human communication and existing dialogue models, we propose a new task G4C: Generating Guidance in Goal-Driven and Grounded Communication. G4C considers three building blocks: intent, guidance, and action. The task envisions a teacher with intent for specific student action, guidance uttered by the teacher, and action undertaken by the student based on the guidance and common ground. G4C evaluates the ability of a teacher to provide intentional guidance that results in intended student actions.∗ The success of the teacher’s guidance depends on whether the student’s ∗Here we use actions to indicate any linguistic behavior with intention [2]. 61 Inside DM's (Theory of) Mind DM Intent I want the players to make a perception check to find out about the goblins to get the letter "You notice some movements in the bushes" “There might be something hiding there, let’s go take a look. Perception check!” Anticipated Player Players Action Guidance Matches Intent Shared Common Ground between the DM and Players Players were hired by a dwarf named Gundren Rockseeker to transport a wagonload of provisions to Phandalin. After a day and a half of travel, the players got onto a smaller trail not as well maintained... Information Only Available to the DM Five goblins hid in the bushes near the trail ready to attack the players. Upon defeating them, players can find a letter from one of the goblin's pockets showing that Gundren has gone missing... Figure 5.1: A motivating example. The (human) Dungeon Master (DM), knowing the desired story path, intends the players to perform actions to find out about the goblins—the first plot point that will eventually lead the players to treasure. They generate the guidance “You notice some movements in the bushes” using theory-of-mind by inferring that the players will perform the desired actions upon hearing their words. subsequent action matches the teacher’s intended action. Using this task formulation, we analyze if the teacher has fulfilled their communicative intents explicitly by examining what the student says afterward. G4C further requires the dialogue to be grounded, meaning that both the teacher and the student are communicating with a shared environment and background. To train models to perform G4C, we use Dungeons and Dragons (D&D) as our environment, a game that heavily relies on communication that is inherently goal-driven and grounded. D&D is a role-playing game consisting of multiple player characters and a Dungeon Master (DM) who collaborate to achieve a set of goals beneficial to the players. The DM, the narrator and host of the game, has an innate motivation 62 to guide the players to perform a series of actions that roughly follow a pre-devised storyline culminating in a global goal, all grounded in a shared fantasy world. An example of each component of G4C in the D&D environment (intent, guidance, and action) is shown in Figure 5.1. We construct 47k D&D dialogues from transcripts collected by [21]. Motivated by the critical roles intents and theory-of-mind (ToM) play in human communication, we study the following central research question: “Does incorporating intent and ToM make computational models better communicators?“ Accordingly, we explore different methods for modeling intent and ToM for G4C in Section 5.3. Specifically, we make the intents of the teacher (DM) explicit by mining intents from large language models (LLM) and appending them as additional context to guide generation. We further propose a method to train a DM to generate guidance for a player with RL inspired by ToM. The DM first predicts in advance what action the player will take in reaction to the guidance and then uses this prediction as a feedback reward function to check whether the predicted action matches DM intent. G4C focuses on mimicking human communication that is goal-driven and coherent to a grounded narrative, which current automated dialogue metrics do not capture well. As such, we further propose novel human and automated evaluation metrics to measure whether the output fits in the grounded context and fulfills communicative goals. Our experiments show that DMs trained with explicit intents and ToM to predict how their players will react to their utterances ahead of time triples the number of responses generated that are both grounded and fulfill the communicative intent. 5.2 G4C and G-Dragon Here we discuss how we construct the environment for the proposed G4C task using a dataset of dialogues from Dungeons and Dragons (D&D) called G-Dragon. We start with formulating the G4C task, then introduce the D&D data, and finally present our procedure of constructing the environment using large-scale data. 63 Character Game Dialogue DM A dwarf named Gundren Rockseeker has hired you to transport a wagonload of provisions to the rough-and-tumble settlement of Phandalin... You all notice some movements in the bushes nearby the road... Clint "There might be something hiding there, let’s go take a look." Clint makes a perception check. 16 Vi I’ll help as well. I got a 10 DM Clint, you notice a few goblins crouching in a part of the shaded woods off to the side of the road... Table 5.1: Example dialogue transcript from D&D game play. 5.2.1 G4C Task Consider three variables in communication between a teacher and a student: context C, teacher utterance T , and the subsequent student utterance S. In standard dialogue response generation (RG) setup, models are trained to generate the next utterance only based on the previous dialogue history, i.e., P(T |C) for teacher and P(S|C, T ) for the student. In our task setting, we further consider one variable: intents of the teacher: IT . † In G4C, we assume that the teacher’s intents are to guide the student to perform certain action A and the intents are fulfilled if the student’s subsequent utterance S entails A. Since we focus on verbal communication, all variables including IT and A are in natural language (NL). The teacher model’s goal is thus to first come up with an intent, i.e., P(IT |C) and then generate an utterance that helps achieve the intent, i.e., P(T |C, IT ) such that S ≈ A, given student model P(S|C, T ). 5.2.2 D&D Dialogue Generation as a Partially Observable Markov Decision Process Here we discuss a reformulation of the standard RG problem as a partially observable Markov decision process (POMDP). We consider a POMDP defined as ⟨S, A, T, R, O⟩, where S is a set of states, A is a set of actions performed by the teacher (note it is different from the player action A), T is a set of transition † Students also have intents, which are not explicitly modeled in this work. 64 Training Inverse Dynamics Model (IDM) to Provide Pseudo Labels How to Extract Guidance Labels from Noisy Raw Data? A dwarf named Gundren Rockseeker has hired you to transport a wagonload of provisions to Phandalin. After a day and a half of travel, you got onto a smaller trail not as well maintained, and bandits and outlaws have been known to lurk along the trail. You notice some movements in the bushes… Collect human labels on ~2k instances T5 Labeler Train IDM to locate guidance given DM posts Label raw 43k D&D data with IDM to train forward DM Models Figure 5.2: Illustration of IDM. We collect 2.5k human labels on guidance and train an IDM labeler to generate pseudo labels for unlabeled large corpus. probabilities between states (T(s ′ |s, a)), R is reward function, and O is a set of observations. In D&D dialogues such as Table 5.1, we consider the first DM sentence (not in bold) as the observation containing an incomplete description of world state, the second sentence in bold as the action containing guidance for players, the next player turns as reward (in this case players’ perception check‡ matches DM intent), and the final turn as new observation. 5.2.3 Play-By-Post D&D Data As introduced in Sec. 5.1, D&D satisfies two crucial aspects we investigate in G4C: goal-driven (players are motivated to finish quests guided by the DM) and groundedness (players and DM are co-located in the environment and narratives). Furthermore, the DM is constantly providing guidance to other players, matching the teacher role in G4C. We use actual play game transcript dataset from [21] scraped from Play-By-Post (PBP), a web forum§ where people play D&D by taking turns posting on the forum. PBP data contains more than 800k turns with around 58M words, annotated heuristically with game state ‡Ability check is a game mechanic that models the stochasticity in D&D. The player needs to roll a die and the number determines whether the action succeeds or not. § https://www.dndbeyond.com/forums/d-d-beyond-general/play-by-post 65 information such as player class, race, and ability checks. However, to adapt this dataset to our G4C setting, we need to filter the data to focus on interactions of DM guiding players. To use PBP data for G4C, several non-trivial challenges exist. First, posts from DM often contain many non-guidance noises such as out-of-character chitchat, rule discussion, and combat ruling. Second, DM often addresses multiple players and we focus on teacher-student 2-participant interaction in this work (we leave multi-party goal-driven dialogue to future work). Lastly, dialogues from forums are not strictly chronological, meaning that the n-th post might not be responding to the (n-1)-th post due to asynchrony. Due to the above challenges, we propose our methods to filter raw post data to get threadlike dialogues between the DM and a player that follows chronological order. We filter PBP data so that each instance contains three components: 1. context/dialogue history (C); 2. DM turn with potential guidance to a player A (DT); 3. player A action turn (PA). To get such thread-like dialogues, we first need to locate which posts contain clear player actions (as feedback to DM’s guidance). Luckily, in D&D, player actions are often clearly indicated by a game mechanic called “ability check” where the player has to roll a die to determine whether their actions such as perception or stealth succeed or not. This provides clear signals of when the players have taken action. We thus regard posts that contain players making ability checks as player action turns PA. Then we look at the previous 20 turns to find potential posts with DM guidance (DT) and context (C). We use two annotated tags from PBP the data: “name_mention” and “reply_to” to locate the DM posts that address the player who makes the ability check. If no posts have been added in the previous 20 turns, we then add the closest turn from the DM that’s not replying to another player. After getting DT, we add turns from the player or DM before the DM turn to our context C, completing a three-component thread-like dialogue instance. 66 Character Game Dialogue Explanation DM A dwarf named Gundren Rockseeker has hired you to transport a wagonload of provisions to the rough-and-tumble settlement of Phandalin... The DM here is providing background for the players and sets up an encounter with the goblins , who will provide players with important clues. DM You all notice some movements in the bushes nearby the road... The DM provides guidance to prompt players to check surroundings so that they can find out about the goblins Clint "There might be something hiding there, let’s go take a look." Clint makes a perception check. 16 The player is making a perception check: a game mechanic that models the stochasticity in the D&D world. The player needs to roll a die and the number determines whether the ability check succeeds or not. Vi I’ll help as well. I got a 10 DM Clint, you notice a few goblins crouching in a part of the shaded woods off to the side of the road. Two of the goblins begin charging your wagon... Roll for initiative! The Dungeon Master describes the outcome of the perception check and starts the encounter with goblins (a battle starts with players rolling for initiative which determines the order that they will take their turns) Table 5.2: Example dialogue transcript from D&D game play with explanations. 5.2.4 Creating the Environment Training a DM to generate guidance using G4C formulation requires first identifying which part of DM’s utterances contains guidance, as the DM also roleplays other characters, chitchat, or discusses rules. Creating such labels requires human-in-the-loop data collection or large offline labeled datasets, both of which are heavily resource intensive [44]. To mitigate such resource constraints, we collect human labels on a small (< 5%) portion of our dataset and then train an inverse dynamics model (IDM) that given the players’ reactions (reward R) after potential DM guidance (action A), extracts which portions of the DM’s utterance contain guidance (Figure 5.2). Given that we cast the dialogue generation in G4C as a POMDP, the forward modeling problem is to generate guidance so that the player’s feedback is as intended, such as making a perception check. Thus our inverse modeling problem can be formulated as given the next player ability check being perception check (feedback/reward), extracting the guiding sentence (DM’s action) from DMs’ utterances. IDM modeling is simpler than forward behavior cloning because it uses a non-causal formulation that exploits both past and future events to identify a guidance sentence [7]. Human Label Collection. We design our human labeling interface to contain 3 questions: 1. Does this DM turn contain guidance or not? 2. If it does, please choose a sentence from the text that serves the purpose 67 of guidance the most. 3. Imagine that you were the player, what ability check would you make? We add the third question to provide more labels to evaluate DM models (discussed in Section 6.4.3). Our designed human annotation interface for IDM labels and evaluation are included in Figures 5.9, 5.10, and 5.11. We recruit around 120 AMT annotators from English-speaking countries (US, UK, Australia, and New Zealand) since our data is in English. We first run a qualification test on 3 of our annotation questions and give qualifications to around 60 annotators who passed our test. Then we provide detailed instructions and examples to them for completing our task. We also made it clear that our data is for research purposes and annotator ID will not be disclosed in any way. Crowd working studies of standard NLP corpora (involving no personal disclosures) are not required by our IRB to be reviewed by them. IDM Training. In practice, we collect around 2.5k human labels on guidance and train IDM to provide labels for the large unlabeled data. We consider two subtasks for IDM: identifying whether a DM turn (DT) contains guidance and extracting the key guiding sentence (GS) from DT. We train two T5-3B models [108], one for classifying DM texts that contain guidance or not (IDM-Identify) and the other for extracting a sentence from the text (IDM-Extract). Specifically, we train two T5-3B models [108] on our collected 2.5k human labeled dialogues, one for classifying DM texts that contain guidance or not (IDM-Identify) and the other for extracting a sentence from the text (IDM-Extract). For IDM-Identify, we treat the task as a binary prediction task and trains T5 to generate either 1 (contains guidance) or 0 (non-guidance) given the raw DM turn. For IDM-Extract, which is a harder task to select one sentence from the raw DM post as the most important guidance sentence, we have explored several approaches. We tried a text rewriting formulation that trains models to generate a special symbol (*) before and after a sentence in given text and an index selection formulation where we pass in DM turn indexed (e.g., “1. A dwarf... 2. You notice some...”) and train the model to generate an index number (“2”). Empirically we find the latter performs better. 68 Explicit Intent + Theory of Mind Implicit Intent Context A dwarf named… Guidance You notice some movements in the bushes Context Explicit Intent Guidance Intent Perception Check + Context Guidance Intent + Check if Action matches intent Anticipating Player Next-Turn Action Figure 5.3: Sketches of the three categories of methods IDM Model Evaluation We evaluate the IDM labeling performance on the test split of our human labels with 3-way redundancy on each label. We also tried other baselines for IDM-Extract: 1) longest sentence; 2) last sentence; 3) 3-shot in-context learning using GPT-3 by asking them to select an index (same format as IDM); 4) encode each sentence and next-turn player action using SentenceBERT [112] and use cosine similarity to find the most similar sentence to the player action. The IDM-identify model reaches 82% accuracy on binary classification tasks and IDM-extract model reaches 70% accuracy on a 10-way classification task (random guessing 10%). The best-performing baseline is 3-shot GPT-3 with in-context learning which reaches 55%. We argue that this task is hard and subjective as human agreements are very low. However, experimental results on using IDM-generated labels (Table 5.4) shows that it provides helpful signals and outperforms training on human labels significantly. We also trained a DM model using GPT-3 labels and observe drops in performance overall. IDM Results. We evaluate IDM performance on 1k human-labeled data and compare it to baselines such as the longest sentence and GPT-3 with in-context learning. In summary, we find that trained IDM outperforms other baselines on extracting GS, reaching around 70% accuracy where random guessing is 10% (the average number of sentences in DM’s posts is around 10). 69 3. Modeling with Explicit Intents + Context Intent “The DM intends the players to make a perception check to find out about the goblins” Guidance 1. Mining Intents from LLM Action T5 Intent 2. Training Intent Generator Given Only Context Context Context Intent Guidance Figure 5.4: Illustration of intent modeling. We first mine intents from LLM and then train an intent generator to generate intent as additional context to train the DM model. 5.3 Theory-of-Mind Inspired Guidance Generation in Grounded Environments This section introduces our exploration of model designs to train a teacher model that can guide the student to perform certain actions by speaking in a grounded environment. We are specifically interested in the research question “Does incorporating intent (5.3.1) and theory-of-mind (5.3.2) help models generate better guidance?” 5.3.1 Modeling Intents Implicit Intent. We start with the standard RG setup in most dialogue modeling work: training models to directly generate the target utterance (guidance) given dialogue context with no explicit intent involved. Formally, we model P(T |C) using the DM text with guidance as teacher target utterance T and the context turns as C. Explicit Intent with Generator. Here we propose modeling methods that include explicit intents of the teacher IT . Following 5.2.1, we treat the teacher’s intents as additional context appended to the dialogue context, i.e., P(T |C, IT ). Figure 5.4 shows the procedure. 1. Mining Intents Using Large 70 Context …you got onto a smaller trail not as well maintained… You saw two dead horses. + Intent I want players to make a perception check so they can engage the goblins You notice some movements in the bushes You saw two guards Player Model Input: DM Guidance Output: Ability Check Stealth Generated Guidance Anticipated Player Action Medicine Perception 0 0 1 Reward Intent2Action Matching ToM-Inspired RL by Anticipating Player Action Figure 5.5: Illustration of our ToM-Inspired RL by using a reward function to help DM model anticipate what the players might do upon hearing the generated guidance. We give the model a reward if the predicted player action matches the intent given. Language Models (LLMs) Since intents are implicit in the data, we first need to mine DM’s intents from their utterances. To ensure the quality of mined intents, we use LLM such as GPT-3 to generate intents in natural language given context, guidance sentence from DM, and the next-turn player action. We prompt GPT-3¶ with “The following is a conversation that happened in a game of Dungeons and Dragons: [Context] [DM Text] [Player Name]:[Player Ability Check] Question: What do you think that the DM intentds to do by mentioning [Extracted Guiding Sentence]? Answer:” 2. Training Intent Generator Using mined intents, we train an intent generator (IG) that takes the context C as input and generates an output of the DM’s potential intent IT . In practice, we train a sequence-to-sequence model T5 [108] on 45k mined intents for our training and valid data. We also conduct a human evaluation on both mined and generated intents to examine whether these intents are reasonable given the context. Humans rate 85% of the mined intents and 75% of generated intents proper with 3-way redundancy of each intent from sampled 500 intents. 3. Modeling with Generated Intent With a trained IG, we then generate intents on our test split. Then the teacher model that takes intents as additional input will use the generated intents from IG to generate utterances during testing. ¶We use text-davinci-03 from https://beta.openai.com/docs/models/gpt-3 71 5.3.2 Modeling (Limited) Theory-of-Mind (ToM) Using RL for Guidance Generation Background and Intuition. Here we model a limited scope of ToM by modeling the anticipated action of the players in order to help the teacher to generate utterances that guide students to fulfill the teacher’s intents. Specifically, in G-Dragon, the DM infers what the players might do when they provide different guidance. For example, “you notice some movements in the bushes” will likely motivate the players to make a perception check while “the guard seems a bit shaken to hear your words” might prompt the players to make a persuasion check. DM then chooses the guidance that will more likely prompts players to perform the action that fulfills the goal. Training Player Model. The first step of our proposed ToM-teacher is to train a player model (PM) that takes in context and DM utterances and outputs the most likely player action (ability check), i.e., P(A|C, T ). Luckily, each instance of our G-Dragon data naturally contains training data for PM with the DM turn and next-turn player ability check. We also train a sequence-to-sequence model T5 [108] to predict the player action using our data. The trained PM reaches around 71% accuracy in predicting the actual player ability check. To get an approximate upper bound of the task, we ask humans to predict the next player action on our test set and observe only about 76% accuracy in matching with players in the data transcript. This might be due to the players actually playing the game also considering other factors when making the decisions that we do not have in our data: long-term character goal, detailed persona, player roleplaying style, etc. We argue that our player model presents a reasonable proxy of what a player might act given the context provided. Player Action-Intent Matching as Reward. With a player model approximating player reactions, we then use Reinforcement Learning (RL) to reward the DM model if it generates guidance that will lead the PM to perform an action matched with intent (Figure 5.5). Specifically, during training the Mined Intent and Generated-Intent models introduced in Section 5.3.1 to model P(T |C, IT ), we pass the model output T to the trained PM (P(A|C, T )) and get predicted player action A. Since intents are in NL, we train a 72 matching module Intent2Action to convert them to the most likely ability check such as “perception” (23 types in total), P(AT |IT ). Finally, we examine whether the predicted action from PM (A) matches with the intended action (ability check) from the DM (AT ). Finally, we give the model reward of 1 if the actions match and 0 if not. Intuitively this helps shape models to generate guidance more aligned with intents by simulating what the players might do one step ahead. 5.4 Evaluating G4C Here we propose multifaceted evaluation protocols to measure the quality of the DM/teacher model for G4C. We introduce three criteria, Fluency, Groundedness, and Goal-Fulfillment, to evaluate model outputs. We design automatic metrics and human evaluation protocols for each criterion, and analyze how well the proposed metrics correlate with human judgments in 5.5.2. We refer to outputs satisfying all three criteria as star DM. 5.4.1 Measuring Fluency We first examine whether the output text sounds natural and fluent as a DM. Automatic Metrics: Matching with References. As with most dialogue evaluation metrics, we use human-written responses as ground truth references and compare the output with them. The closer the output is to the human original response, the more fluent∥ . We use standard natural language generation (NLG) metrics such as BLEU [98] and ROUGE [75], etc. to measure the overlap between the output and reference. Human Evaluation. For each response, we ask three annotators to “evaluate whether the response sounds natural and fluent. If anything seems off or sounds weird—confusing, illogical, repetitive, or factually wrong—then choose No.” and use majority voting. ∥ Perplexity is also often used to measure fluency, but this measure isn’t relevant as we don’t use autoregressive LMs. 73 Model Variant Base Model Input Implicit Intent Human-Label T5-3B Context IDM-Label T5-3B Context Explicit Intent Mined Intent T5-3B Context+Intent Gen. Intent T5-3B Context+Intent Explicit Intent + ToM-Inspired RL RL+Mined Intent T5-Large Context+Intent RL+Gen. Intent T5-Large Context+Intent Table 5.3: Model variants. All targeted outputs are guidance from DM. All training data size is 41k except for human-label (2k). The test set (1k) is shared across all. 5.4.2 Measuring Groundedness G4C focuses on grounded communication, where the teacher and student share environment and background knowledge. Thus, here we focus on evaluating whether the generated output is grounded to the context of the story built by the DM and players. Automatic Metrics: Entity Matching. We design an automatic metric to measure the entity overlap between those mentioned in the context and in the generated output. Intuitively, the generated responses should not have mentions of entities not in the context, otherwise, the model is hallucinating. We use a RoBERTa-large-based [78] named entity recognizer (NER) to extract entity mentions such as person’s names and locations from both the context and the model output and calculate their overlap (the higher the better). Human Evaluation. Since groundedness also covers other aspects (narrative flow, style, etc.) than entities, we conduct a human evaluation to measure whether the response sounds like it is continuing the same story from context. For each response, we ask three annotators to “evaluate that given the conversation context, whether the response sounds like it’s continuing the same story (grounded), or beginning a new story (NOT grounded)?” 74 Dimensions Metrics Human-Label 2.5k IDM-Label 41k Random-Label 41k Fluency Human Evaluation 0.80 0.81 0.56 Groundedness Entity Matching 0.749 0.776 0.718 Human Evaluation 0.91 0.92 0.72 Goal-Fulfillment Guidance Classification 0.438 0.474 0.254 Player Action Matching 0.261 0.262 0.249 Human Evaluation - Guidance 0.21 0.23 0.20 Human Evaluation - Action Matching 0.11 0.17 0.13 Table 5.4: Results on the 3 dimensions using metrics from Section 5.4 comparing models that use IDMgenerated pseudo-labels and human-generated labels. 5.4.3 Measuring Fulfillment of Intents The core measure of the success of models for G4C is whether the goal of the teacher is fulfilled by making the response. Specifically, we want to measure, whether the generated output 1) indeed contains guidance for the student and 2) guides the student to perform the action that the teacher wants them to take (action matching). Automatic Metrics: Guidance Classifier and Player Action Matching. To evaluate whether the generated output contains any guidance, we reuse the IDM-Identify model discussed in 5.2.3 that takes the input of DM posts and predicts whether this post contains guidance or not. For action matching, since it is infeasible to collect the original players’ responses on all model outputs, we train a player model (PM) to generate potential actions given DM model outputs. Finally, we compare the predicted action with the actual player action after the human DM guidance from the dialogue transcript. The higher the percentage of matching human player action, the better the model is at generating guidance that achieves the same goal as human DM. Note that although we also train a PM for ToM modeling in 5.3.2, the PM used for evaluation is a distinct model based on a larger model and trained on the test set of the data as well. Human Evaluation. To evaluate guidance, we ask annotators: “Is this response providing guidance to the players?” For action matching, we ask crowdsourcing workers to write down the most likely ability check that they think the player will take after the given DM utterance. We also provide annotators with the player character’s race and class to better approximate the players. 75 5.5 Experimental Results We aim to answer three research questions through our experiments: 1) Do IDM-provided labels help train models that generate better guidance? 2) Does explicitly incorporating intents result in better models? 3) Does theory-of-mind modeling help models become better communicators? 5.5.1 Compared Models We use T5-3B [108] as our base model. We train a model with only 2.5k human-labeled guidance data collected in 5.2.4 (Human-Label). Then we train IDM on human labels and provide labels for the rest of the 41k unlabeled dialogues (IDM-Label). Next, we explicitly incorporate intents in modeling and consider two model variants following 5.3.1: Mined Intentthat is given intents mined from LLM using both context and next-turn player actions; Generated Intent, where the model is trained on mined intents, but during test time, we train an intent generator to provide intents without knowing future turns. Finally, following Section 5.3.2, we use a trained player model to provide reward signals for DM models for RL. We use T5-Large for RL training on top of mined intent (RL-ToM-Mined) and generated intent (RL-ToM-Gen.) models. We use RL4LMs [109] to implement the reward function and use Proximal Policy Optimization (PPO) [122] for RL training. A summary of model variants is shown in Table 5.3. 5.5.2 Correlation Analysis of Automatic Metrics Here we present correlation results of automatic metrics in Sec. 5.4 using human evaluation results (with an average inter-annotator agreement of 0.78) on our test set. For fluency, we find a statistically insignificant correlation (p-values > 0.05) between automatic metrics that measure lexical matching with a reference response. We suspect that 1) lexical matching does not reliably capture the naturalness of languages [116, 32] and 2) many plausible responses can be made given the same context [181], making comparing with 76 Fluency Groundedness Guidance Action Matching Star DM Evaluation Dimensions 0 20 40 60 80 100 Percentage 81 92 23 17 12 85 90 36 23 17 86 91 39 18 14 Comparison of Models with Implicit and Explicit Intents Implicit Intent Mined Intent Gen. Intent Figure 5.6: Results comparing implicit and explicit intent models. We observe models with intent generate dramatically more guidance. the single reference unreliable. For both groundedness and goal-fulfillment, we find statistically significant (p-value < 0.0001) correlations between automatic metrics (entity matching, guidance classifier, and action matching) and human judgments on test instances. Conclusion: for fluency, we will use human evaluation and for groundedness and goal-fulfillment, the automatic metrics provide a reasonable proxy. 5.5.3 Results and Analysis Do IDM-provided labels help models generate better guidance? Here we examine the effects of our inverse dynamics models on training DM models for G4C. Table 5.4 presents the results following our evaluation dimensions introduced in Section 5.4. We see that models trained using our IDM-provided labels outperform those trained on the small number of high-quality human labels on all measures. To show that data size alone is not sufficient for training a good DM model, we randomly assign labels of guiding sentences on the same number of training instances as IDM models (“Random-Label 41k”) and find the performance is significantly worse than either of the models using human or IDM labels. This shows that the quality of IDM-provided labels is critical for DM modeling and our IDM offers a scalable and affordable solution to obtain a large number of quality labels requiring only small-scale human annotation. 77 Fluency Groundedness Guidance Action Matching Star DM Evaluation Dimensions 0 20 40 60 80 100 Percentage 85 90 36 23 17 70 92 48 42 44 Human Eval Comparing non-ToM and ToM Models with Mined Intents Mined Intent (3B) ToM-RL Mined (Large) Fluency Groundedness Guidance Action Matching Star DM Evaluation Dimensions 0 20 40 60 80 100 Percentage 86 91 39 23 14 74 91 54 52 51 Human Eval Comparing non-ToM and ToM Models with Gen. Intents Gen. Intent (3B) ToM-RL Gen. (Large) Figure 5.7: Human Evaluation comparing non-ToM and ToM models with mined (Left) and generated (Right) intents. Does explicitly incorporating intents help? Figure 5.6 shows results comparing the best model with no explicit intents (IDM-Label), mined intents, and generated intents. We find that models with explicit intents perform on par on groundedness, but improve on fluency, guidance, and action matching. The improvement is especially dramatic on the Goal-Fulfillment aspect, as adding intents increases the proportion of outputs that contain guidance by more than 50% and action matching by more than 30%. We speculate that this might be due to explicit intent modeling, as the model is biased towards generating output that is aligned with the intent instead of purely modeling the most likely next possible sequence of tokens. Can we model theory-of-mind using reinforcement learning? Last but not least, we are interested in whether the ToM-inspired reward function we design in Section 5.3.2 can help train better 78 Fluency Groundedness Guidance Matching Star DM 0 20 40 60 80 100 Implicit Intent Expliict Intent Intent+ToM Figure 5.8: Summary of performance on different evaluation aspects from the BEST 1) implicit intent model (IDM-Label 41k, 2) explicit intent model (Mined Intent), and 3) intent with ToM-inspired RL (ToMRL Generated Intent). communication models for G4C. Figure 5.7 shows the results of adding ToM to mined intent (left) and generated intent (right) models. We find that despite using a much smaller (1/4 parameter) base model, models with reward function mimicking ToM can outperform the no-ToM variants on generating 40% more outputs with guidance that lead to players to perform the action matching intents while performing on par on groundedness. We also find that the fluency drops, possibly due to using a smaller base LM (due to memory constraints) and RL training affects the naturalness of outputs. Potential remedies we plan to explore in the future include using larger models and modifying the reward function to also account for fluency such as using KL divergence. Even with the drop in fluency, however, we still observe that with ToM-inspired RL, models can generate responses that satisfy all measures (star DM) up to 3.5 times more than without ToM modeling. Finally, we present an overall comparison between the best models under each category (implicit intent, explicit intent, explicit intent with ToM modeling) in Figure 5.8. All three variants perform on par 79 with groundedness. And while fluency drops when adding explicit intents and ToM, these two additions improve dramatically on the goal-driven aspects (guidance and action matching). Models with both explicit intents and ToM modeling using RL perform overall the best and produce almost threefolds of human DM-like (star) responses than others. This shows a promising sign that both intents and ToM-inspired RL can help goal-driven models to better achieve their communicative intents. 5.6 Conclusion We propose G4C to study goal-driven and grounded language interactions focusing on generating guidance from the teacher to lead students to perform certain actions. We use D&D as our test bed and construct large-scale data G-Dragon by using IDM to provide quality labels. We train models to generate guidance by modeling intents and theory-of-mind. Results show a promising sign that incorporating explicit intents and ToM modeling makes better communication agents. 5.7 Ethics and Broader Impact Our study is conducted in English, which benefits English speakers more. D&D is also more popular in the western world. We use Amazon Mechanical Turk to recruit crowdsourcing workers and we pay workers over $15/hour on average, well above the highest state minimum wage, and engage in constructive discussions if they have concerns about the process. We also give each annotation instance enough time so that we do not pressure annotators. The online forum D&D gameplay data we use from [21] might contain aggressive language. Our intents are mined from LLM (GPT-3), which might surface or even amplify harmful content within these models, such as biases and private information. We use a keyword-based filter for both the dialogue and intent data before training our models. 80 Our work deals with communicative intents of neural computational models. However, we want to emphasize that the intents of AI models (especially conversational systems) should be closely monitored and regulated [29]. In our work, we choose a fantasy domain with a relatively low stake to study model intentions with the overall goal of assisting players (humans or AI) to have a better experience in a roleplaying game. 5.8 Limitations Here we discuss several limitations of our work and point to potential future work directions. First, we focus on single teacher and single student setup to study guidance generation whereas in real life there often are multiple teachers and students. We plan to extend to multi-party goal-driven communication and D&D also provides a proper testbed to study this problem. Second, there are more nuances in guidance: railroading direct guidance (“make a persuasion check”) and subtle indirect guidance (“the guards seem to be a bit shaken”). We did include them in our human labeling and evaluation interface but did not specifically distinguish them during modeling. Third, due to the constraints on input sizes for most LMs, we have to set a context window to study dialogue generation in D&D. However, both DM and players have a long-term memory about the comprehensive story progression which might influence how they communicate. As a next step, we plan to use summarization models and adventure books as narrative backgrounds to ground our G4C task with a larger world setting. We include answers to other Frequently Asked Questions (FAQ) here. 81 5.8.1 Why only training a DM model to generate guidance instead of everything a DM says? A DM needs to do multiple complex language tasks (see [21] for more analysis) such as world modeling, storytelling, role playing with a persona, judging rules, etc. And we argue that these span multiple papers or even thesis. Instead of conflating all kinds of language tasks DM is performing, we focus on the goal-driven aspect of DM: generating guidacne for players to proceed the story. This task is both critical since human language usage always comes with a purpose [2] and challenging as even LLMs such as ChatGPT [93] often lack the ability to produce an utterance that fulfills a communicative intent. We also argue that with the key capability of generating guidance fulfilling intents, the model can be combined with models with different focus such as storytelling, describing world state, etc. to mimic a human DM. 5.8.2 How generalizable is a DM model on other domains? D&D is a specific domain we choose to study G4C due to its grounded and goal-driven nature. We admit it is non-trivial to directly apply a DM model on other domains. However, we believe that the insights from our modeling approaches attempting to incorporate intents and ToM can generalize to other domains. Specifically, explicitly including intents in context and using RL to model ToM by anticipating others’ reactions can be easily applied in other scenarios. For example, we can generate intents for a open-domain chatbot such as expressing empathy toward users or make suggestions on an issue the user is facing and using ToM modeling to better generate utterances that achieve those purposes. 82 Figure 5.9: Inference collection collecting guidance labels. 83 Figure 5.10: Inference collection collecting guidance labels. 84 Figure 5.11: Evaluation interface. 85 Chapter 6 Eliciting Planning in Large Language Models by Self-Discovering Reasoning Structures 6.1 Introduction Large Language Models (LLM) [17, 24, 95, 6] powered by transformers [146] have produced impressive breakthroughs in generating coherent texts [93], and following instructions [178, 86, 152, 25, 96]. In pursuit of the goal to enhance LLMs’ capability to reason and solve complex problems, various prompting methods have been proposed, drawing inspirations from cognitive theories of how humans reason. For example, few-shot and zero-shot chain-of-thought (CoT) [92, 153, 67, 166] resembles how humans solve problems step-by-step, decomposition-based prompting [179, 36, 100, 55, 65] is inspired by how humans breakdown a complex problem into a series of smaller subproblems, and then solve those subproblems one by one [102], and step-back prompting [177] is motivated by how humans reflect on task nature to derive general principles. However, a fundamental limitation is that each technique itself serves as an atomic reasoning module making an implicit prior assumption of the process on how to tackle a given task. Instead, we argue that each task has a unique intrinsic structure underlying the reasoning process involved 86 in solving it efficiently. For instance, least-to-most prompting [179, 36] has shown to be much more effective than CoT [153] at solving tasks such as symbolic manipulation and compositional generalization, due to the decomposition structure of the tasks. Task Chain-of-Thought (CoT) Direct Answer Answer Self-Discover Reasoning Structures (Ours) Task Answer Task Task-Specific Reasoning Structure Rationale Answer Structured Reasoning Avg. BBH: +11% T4D: + 39% MATH: +5.5% Avg. BBH: +7% T4D: + 29% MATH: +8.5% Figure 6.1: Self-Discover guides LLMs to self-discover and compose atomic reasoning modules into a reasoning structure to solve challenging tasks. Through testing on challenging reasoning benchmarks including Big Bench-Hard (BBH), agent reasoning (T4D), and MATH, we find that Self-Discover outperforms Direct Answering on 23/25 and CoT on 21/25 tasks in zero-shot setting using PaLM 2-L. This paper aims at self-discovering the underlying reasoning structure unique to each task, while being highly efficient in terms of computation. Our approach, Self-Discover, is inspired by how humans internally devise a reasoning program for problem-solving [90, 111], as illustrated in Figure 6.2 . From a set of atomic reasoning modules described in natural language such as “breakdown into sub tasks” and “critical thinking”, an LLM, and task examples without labels, Self-Discover composes a coherent reasoning structure intrinsic to the task (Stage 1) and then solves instances of the task using the discovered structure (Stage 2). Stage 1 operates at the task-level and uses three actions to guide the LLM to generate a reasoning structure for the task. At Stage 2, during the final decoding, the LLM simply follows the self-discovered structure to arrive at the final answer. 87 Solving problems using Self-Discover brings several benefits compared to other methods for LLM reasoning. First, the discovered reasoning structure is grounded in atomic reasoning modules benefiting from the strengths of multiple reasoning modules in contrast to applying a priori module such as CoT. Second, Self-Discover is efficient in computation as it only requires 3 more inference steps on the task-level, while being more performant than inference-heavy ensemble approaches such as self-consistency [151]. Lastly, the discovered reasoning structure is intrinsic to the task, and conveys LLMs’ insights about the task in a more interpretable way than the optimized prompts [188, 162]. We test Self-Discover on 25 challenging reasoning tasks including Big Bench-Hard (BBH) [136], Thinking for Doing (T4D) [185] and MATH [57]. Self-Discover outperforms CoT on 21/25 task with performance gains up to 42% (Figure 6.1), highlighting the advantage of the self-discovered reasoning structure composed from the atomic reasoning modules against a single a priori CoT module. Furthermore, we demonstrate that Self-Discover achieves superior performance against inference-heavy methods such as CoT + Self-Consistency and majority voting of every module while requiring 10-40x fewer inference compute (Figure 6.6). Finally, we compare Self-Discover with prompts optimized (OPRO) using a training set [162] (Figure 6.10). We find that Self-Discover still performs on par or better than OPRO while the self-discovered reasoning structure are much more interpretable. We conduct a set of analysis to understand the effectiveness of Self-Discover. By breaking down BBH tasks into 4 different categories, we find that Self-Discover performs best on tasks requiring world knowledge and has a moderate performance boost on algorithmic tasks compared to CoT (Figure 6.5). This is further confirmed by the error analysis on MATH, where 74.7% model failures comes from computation errors (e.g. math). We also take a closer look at the self-discovered reasoning structures, and show the universality of them by transferability study from PaLM 2-L to GPT-4, and from GPT-4 to Llama-2-70B. We hope to encourage more future work on structured reasoning for solving challenging problems using LLMs. 88 6.2 Self-Discovering Reasoning Structures for Problem Solving Language Model Atomic Reasoning Modules Reasoning Structure { "Type and color of each item": "" "Number of items of each color": "" "Number of items of each type": "" "Number of items of each color and type": "Final answer": } Task: Reasoning colored objects Self-Discover Language Model Answer Stage 1: Discover Reasoning Structure on Task-Level Stage 2: Solve Problems Using Discovered Structure on Instance-Level Task Instance Reasoning Structure Key-Value pairs Fill in the Values based on Keys during decoding Figure 6.2: Illustration of using Self-Discover for problem-solving. Given a generative LM, task, and seed reasoning module descriptions, we guide LMs to generate a reasoning structure in key-value format to solve the task. Finally, models can follow the self-discovered structures to solve the every instance from the task by filling in the values in JSON step-by-step. We take inspiration from how humans use prior knowledge and skills to devise a reasoning program to solve problems [90, 111]. When we face a new problem, we often first search internally what knowledge and skills from our prior experience might be helpful to solve it. Then we will attempt to apply relevant knowledge and skills to this task. And finally we will connect multiple individual skills and knowledge to solve the problem. We design Self-Discover to enact these steps into two stages as illustrated in Figure 6.2. Given a task and a set of reasoning module descriptions representing high-level problem-solving heuristics such as “Use critical thinking” and “Let’s think step by step”, Stage 1 of Self-Discover aims to uncover the intrinsic reasoning structure for solving this task via meta-reasoning. Specifically, we uses three meta-prompts to guide LLMs to select, adapt, and implement an actionable reasoning structure with no labels or training required. We format the structure in key-value pairs similar to JSON due to interpretability and findings on following JSON boosts reasoning and generation quality [185, 94]. Stage 1 operates on task-level, meaning we only need to run Self-Discover once for each task. Then, in Stage 2, we can simply use the discovered reasoning structure to solve every instance of the given task by instructing models to follow the provided structure by filling each key and arrive at a final answer. 89 6.2.1 Stage 1: Self-Discover Task-Specific Structures The first stage consists of three actions: 1) SELECT, where relevant reasoning modules for task-solving are chosen from the set of reasoning module descriptions; 2) ADAPT, where descriptions of selected reasoning modules are rephrased to be more specific to the task at hand; and 3) IMPLEMENT, where the adapted reasoning descriptions are implemented into a structured actionable plan so that the task can be solved by following the structure. Self-Discover SELECT ADAPT IMPLEMENT ❖ Step-by-step ❖ Break down ❖ Propose-verify All Seed Modules ❖ Step-by-step ❖ Break down ❖ … Selected Modules ❖ Step-by-step analyze each item ❖ Break down to type and color … Adapted Modules { "Type and color of each item": "Number of items of each color": "Number of items of each type": ...} Reasoning Structure Language Model Selected Modules Language Model Adapted Modules Language Model Figure 6.3: Illustration of three actions of Self-Discover. We use LMs to compose a coherent reasoning structure by selecting relevant modules, adapting to task-specific descriptions, and implement a reasoning structure in JSON. SELECT First, not every reasoning module is helpful for every task, so the first stage of Self-Discover guides model to select modules that are useful based on task examples. For example, “reflective thinking” might help search for first-principle theories on science problems, while “creative thinking” helps on generating a novel continuation to a story. Given raw set of reasoning module descriptions D such as “critical thinking”, and “break the problem into sub-problems”, and a few task examples without labels ti ∈ T, SelfDiscover first selects a subset of reasoning modules DS that are useful for solving the tasks by using a model M and a meta-prompt pS: DS = M(pS ∥ D ∥ ti). (6.1) 90 ADAPT Since each reasoning module provides a general description of how to solve problems, the next step of Self-Discover aims at tailoring each selected module to the task at hand. For example, from “break the problem into sub-problems” to “calculate each arithmetic operation in order” for arithmetic problems. Given selected reasoning module subset DS from the previous step, ADAPT rephrases each of the selected module to be more specific to the task. Similarly to SELECT, this stage uses a meta-prompt pA and a generative model M to generate the adapted reasoning module descriptions DA: DA = M(pA ∥ DS ∥ ti). (6.2) IMPLEMENT Finally, given the adapted reasoning module descriptions DA, Self-Discover operationalizes the reasoning modules into an implemented reasoning structure DI with specified instruction on what to generate for each step. In addition to a meta prompt pI , IMPLEMENT also provides a demonstration of a human-written reasoning structure Shuman on another task to better convert the natural language descriptions into a reasoning structure: DI = M(pI ∥ Shuman ∥ DA ∥ ti). (6.3) 6.2.2 Stage 2: Tackle Tasks Using Discovered Structures After the three stages, we have an implemented reasoning structure DI uniquely adapted for the task we need to solve T. Then we can simply append the reasoning structure to all instances of the task and prompt models to follow the reasoning structure to generate an answer A: A = M(DI ∥ t), ∀t ∈ T. (6.4) 91 6.2.3 Prompt Details Table 6.1 shows all 39 reasoning modules we use for Self-Discover, adopted from [39], that contain cognitive heuristics of problem-solving. Figure 6.4 contains the structure of the three actions of Self-Discover during Stage 1, where it discovers an intrinsic reasoning structure on the task-level. For Stage 2, where we use the self-discovered structure to solve the task instances, we start with the prompt: “Follow the step-by-step reasoning plan in JSON to correctly solve the task. Fill in the values following the keys by reasoning specifically about the task given. Do not simply rephrase the keys.”, followed by the reasoning structure, and finally the task instance. Rmd1m1p¦ ŃĸŅĴŇļłŁĴĿļōĸŇĻĸŅĸĴņłŁļŁĺ ŀłķňĿĸņļŁŇłĴņŇĸŃ˭ĵŌ˭ņŇĸŃ ŅĸĴņłŁļŁĺŃĿĴŁļŁĹłŅŀĴŇ˪ ŃĸŅĴŇļłŁĴĿļōĸŇĻļŁľļŁĺ ŀŃĿĸŀĸŁŇĴŅĸĴņłŁļŁĺņŇŅňĶŇňŅĸĹłŅ ņłĿʼnĸŅņҳųĿĿłŊņŇĸŃ˭ĵŌ˭ņŇĸŃĴŁķ ĴŅŅļʼnĸĴŇĶłŅŅĸĶŇĴŁņŊĸŅņ˪ 1d1!¦ ĸĿĸĶŇņĸʼnĸŅĴĿŅĸĴņłŁļŁĺŀłķňĿĸņ ŇĻĴŇĴŅĸĶŅňĶļĴĿŇłňŇļĿļōĸļŁłŅķĸŅ ņłĿʼnĸŇĻĸĺļʼnĸŁŇĴņľ˪ ĸĿĸĶŇņĸʼnĸŅĴĿŀłķňĿĸņŇĻĴŇĴŅĸ ĶŅňĶļĴĿĹłŅņłĿʼnļŁĺŇĻĸŇĴņľņĴĵłʼnĸ˪ ĿĿŅĸĴņłŁļŁĺŀłķňĿĸķĸņĶŅļŃŇļłŁņ Ɣ ŅļŇļĶĴĿŇĻļŁľļŁĺ˪ˬ Ɣ ŇĸŃ˭ĵŌ˭ŇĸŃ˪ˬ Ɣ ŅłŃłņĸĴŁķʼnĸŅļĹŌ˪ˬ ˬ ĴņľĸŋĴŀŃĿĸņŊ̂łĴŁņŊĸŅ˪ ŋĴŀŃĿĸ̻˪ˬ ŋĴŀŃĿĸ̼˪ˬ '¦ ĸŃĻŅĴņĸĴŁķņŃĸĶļĹŌĸĴĶĻŅĸĴņłŁļŁĺ ŀłķňĿĸņłŇĻĴŇļŇĵĸŇŇĸŅĻĸĿŃņņłĿʼnļŁĺ ŇĻĸŇĴņľ˪ ķĴŃŇĸĴĶĻŅĸĴņłŁļŁĺŀłķňĿĸ ķĸņĶŅļŃŇļłŁŇłĵĸŇŇĸŅņłĿʼnĸŇĻĸŇĴņľņ˪ ŀłķňĿĸķĸņĶŅļŃŇļłŁņ˪ Ɣ ŅļŇļĶĴĿŇĻļŁľļŁĺ˪ˬ Ɣ ŇĸŃ˭ĵŌ˭ŇĸŃ˪ˬ ˬ ŀłķňĿĸķĸņĶŅļŃŇļłŁ˪ ĴļŅĸķ ŇĸŃ ĸŀłŁņŇŅĴŇļłŁ ĸĴņłŁļŁĺķĸņĶŅļŃŇļłŁ ŋĴŀŃĿĸ ĸĴņłŁļŁĺĿĴŁ ŋĴŀŃĿĸ ĴņľĸŋĴŀŃĿĸņŊ̂łĴŁņŊĸŅ˪ ŋĴŀŃĿĸ̻˪ˬ ŋĴŀŃĿĸ̼˪ˬ ĴņľĸŋĴŀŃĿĸņŊ̂łĴŁņŊĸŅ˪˫˫˫ Figure 6.4: Meta-Prompts for the three actions of Self-Discover. Each meta-prompt consists of an instruction in the beginning and the end, reasoning module descriptions, and task examples without labels. For IMPLEMENT, to show model an example of a reasoning structure (plan), we present a human-written structure in JSON for another task. 92 Table 6.1: All 39 reasoning modules consisting of high-level cognitive heuristics for problem-solving. We adopt them from [39]. Reasoning Modules 1 How could I devise an experiment to help solve that problem? 2 Make a list of ideas for solving this problem, and apply them one by one to the problem to see if any progress can be made. 3 How could I measure progress on this problem? 4 How can I simplify the problem so that it is easier to solve? 5 What are the key assumptions underlying this problem? 6 What are the potential risks and drawbacks of each solution? 7 What are the alternative perspectives or viewpoints on this problem? 8 What are the long-term implications of this problem and its solutions? 9 How can I break down this problem into smaller, more manageable parts? 10 Critical Thinking: This style involves analyzing the problem from different perspectives, questioning assumptions, and evaluating the evidence or information available. It focuses on logical reasoning, evidence-based decision-making, and identifying potential biases or flaws in thinking. 11 Try creative thinking, generate innovative and out-of-the-box ideas to solve the problem. Explore unconventional solutions, thinking beyond traditional boundaries, and encouraging imagination and originality. 12 Seek input and collaboration from others to solve the problem. Emphasize teamwork, open communication, and leveraging the diverse perspectives and expertise of a group to come up with effective solutions. 13 Use systems thinking: Consider the problem as part of a larger system and understanding the interconnectedness of various elements. Focuses on identifying the underlying causes, feedback loops, and interdependencies that influence the problem, and developing holistic solutions that address the system as a whole. 14 Use Risk Analysis: Evaluate potential risks, uncertainties, and tradeoffs associated with different solutions or approaches to a problem. Emphasize assessing the potential consequences and likelihood of success or failure, and making informed decisions based on a balanced analysis of risks and benefits. 15 Use Reflective Thinking: Step back from the problem, take the time for introspection and self-reflection. Examine personal biases, assumptions, and mental models that may influence problem-solving, and being open to learning from past experiences to improve future approaches. 16 What is the core issue or problem that needs to be addressed? 17 What are the underlying causes or factors contributing to the problem? 18 Are there any potential solutions or strategies that have been tried before? If yes, what were the outcomes and lessons learned? 19 What are the potential obstacles or challenges that might arise in solving this problem? 20 Are there any relevant data or information that can provide insights into the problem? If yes, what data sources are available, and how can they be analyzed? 21 Are there any stakeholders or individuals who are directly affected by the problem? What are their perspectives and needs? 22 What resources (financial, human, technological, etc.) are needed to tackle the problem effectively? 23 How can progress or success in solving the problem be measured or evaluated? 24 What indicators or metrics can be used? 25 Is the problem a technical or practical one that requires a specific expertise or skill set? Or is it more of a conceptual or theoretical problem? 26 Does the problem involve a physical constraint, such as limited resources, infrastructure, or space? 27 Is the problem related to human behavior, such as a social, cultural, or psychological issue? 28 Does the problem involve decision-making or planning, where choices need to be made under uncertainty or with competing objectives? 29 Is the problem an analytical one that requires data analysis, modeling, or optimization techniques? 30 Is the problem a design challenge that requires creative solutions and innovation? 31 Does the problem require addressing systemic or structural issues rather than just individual instances? 32 Is the problem time-sensitive or urgent, requiring immediate attention and action? 33 What kinds of solution typically are produced for this kind of problem specification? 34 Given the problem specification and the current best solution, have a guess about other possible solutions. 35 Let’s imagine the current best solution is totally wrong, what other ways are there to think about the problem specification? 36 What is the best way to modify this current best solution, given what you know about these kinds of problem specification? 37 Ignoring the current best solution, create an entirely new solution to the problem. 38 Let’s think step by step. 39 Let’s make a step by step plan and implement it with good notion and explanation. 93 6.3 Experiment Setup 6.3.1 Tasks We focus on diverse reasoning benchmarks that are still challenging for LLMs: BIG-Bench Hard (BBH) [136] contains 23 carefully-selected challenging tasks from BIG-Bench [133]. BBH tasks cover a diverse range of reasoning problems spanning the following 4 categories according to their authors: 1) Algorithmic and Multi-Step Arithmetic Reasoning, 2) Natural Language Understanding, 3) Use of World Knowledge, and 4) Multilingual Knowledge and Reasoning. We also test on a grounded social agent reasoning task called Thinking for Doing (T4D) where models must leverage mental state reasoning to determine actions to perform [185], where GPT-4 with CoT only reaches around 50%. Finally, we subsample 200 examples from the MATH [57] test set, and generate instance-level reasoning structures via a one-shot demonstration to adapt to the complexity of MATH tasks. For evaluations, we use accuracy to measure the model performance on BBH, T4D and MATH. Specifically, we use accuracy and exact matching as with other methods tested on BBH, T4D and MATH. To properly evaluate the generated answers from LLMs, we prompt the models to end the answer with “Thus, the final answer is [X]”, where X is either one answer option such as “A” or a string such as “valid”. During evaluation, we manually examine each task’s outputs from LLMs and design heuristics to extract the final answers. For MATH dataset, we find that it is challenging to extract the answers accurately. As a result, we subsample 200 test examples from MATH, and manually sanity check and annotate the extracted answers for all methods tested in our paper. 94 6.3.2 Models We use several state-of-the-art LLMs: GPT-4 (gpt-4-turbo-preview) [95], GPT-3.5-turbo (ChatGPT) [93] ∗ , instruction-tuned PaLM 2-L [6] † , and an open-source LLM Llama2-70B [140]. 6.3.3 Baselines We compare Self-Discover with other zero-shot prompting methods for LLM reasoning: • Direct Prompting, where model directly generates the answer without intermediate reasoning steps. • CoT [153, 67], where models are prompted to generate a reasoning process leading to the final answer. • Plan-and-Solve [150], where models are prompted to first generate a plan and then solve the problem. Self-Discover differs by grounding the reasoning structure in atomic reasoning modules, and prompting the decoding to follow the explicit key-value reasoning structure. Next, we also consider other baselines that make use of the raw seed reasoning modules (RM) we pass to Self-Discover. We compare with the following methods’ performance and the inference call efficiency on a subset of tasks. • CoT-Self-Consistency [151], we sample multiple outputs from LLM with CoT and aggregate answers to get the final answer. We compare this method on a subset of tasks due to the cost of repetitive queries. • Majority voting of each RM: we prompt models to solve the tasks by appending each RM and use majority voting of all answers to get the final answer. We examine whether integrating multiple ∗ accessed in October-December 2023 † For MATH, we use a PaLM 2-L model with a stronger instruction tuning to enable better instruction following of more complex reasoning structures. 95 RMs into a coherent reasoning structure is advantageous to applying each RM to solve the task and use majority voting to ensemble them post-hoc, which costs much more inference computation. • Best of each RM: this method assumes that we have access to oracle labels and uses the highest accuracy from applying each RM. We compare with this to examine whether Self-Discover competes with methods that depend on perfect prior knowledge of which RM to use on a new task. Furthermore, for analysis on universality of reasoning structures, we compare with a prompt -optimization method that require a training set to improve prompts: LLMs as optimizers (OPRO) [162]. We aim to show that when we apply structures or prompts optimized from one model, the reasoning structures can retain more performance gains than the wordings of prompts. 6.4 Results We answer the following questions through experimental results: 1) Does discovering reasoning structures improve LLM reasoning capabilities? (6.4.1) 2) Which categories of problems do Self-Discover perform the best? (6.4.2) and 3) Can Self-Discover boost LLM performance efficiently? (6.4.3) Finally, we will show qualitative examples of self-discovered structures, LLM output following the structures, and compare with LLM output following other prompting methods for reasoning (6.4.4). 6.4.1 Does Self-Discover Improve LLM Reasoning? Overall, Self-Discover improves PaLM 2-L and GPT-4’s reasoning across diverse set of reasoning tasks. Table 6.2 shows the overall results on complex reasoning tasks of BBH, T4D and MATH using PaLM 2-L and GPT-4. We compare Self-Discover with baselines including direct prompting, CoT, and Plan-and-Solve (PS). 96 Table 6.2: Self-Discover significantly improves LLM reasoning across a diverse set of 25 complex tasks: BBH, T4D and MATH. CoT: zero-shot Chain of Thought [67]. PS: plan-and-solve prompting [150]. Method BBH T4D MATH PaLM 2-L 56% 30% 45% PaLM 2-L + CoT 60% 40% 42% PaLM 2-L + PS 61% 42% 49% PaLM 2-L + Self-Discover 67% 69% 50.5% GPT-4 58% 51% 70.5% GPT-4 + CoT 75% 52% 71% GPT-4 + PS 73% 53% 70% GPT-4 + Self-Discover 81% 85% 73% On aggregated 23 tasks of BBH, Self-Discover achieves 7% and 6% absolute improvement on PaLM 2-L over Chain-of-Thought and Plan-and-Solve, respectively. Similar gains (6% and 8%) are observed when Self-Discover is applied to GPT-4. Breakdown results of each task’s improvement over direct answering and CoT of PaLM 2-L are shown in Figure 6.1, where we find Self-Discover outperforms them on over 20/24 tasks. Per-task performance on BBH (23 tasks in total) are shown in Table 6.3. On the grounded social agent task T4D, Self-Discover reaches over ≥ 27% (32%) absolute improvement over all baselines on PaLM 2-L (GPT-4). Self-Discover achieves 69% and 85% accuracy on PaLM 2-L and GPT-4, significantly outperforming previous SoTA prompting method such as Foresee and Reflect (FaR) which employs an expert-designed reasoning structure. In contrast, Self-Discover generates the reasoning structure automatically from a set of atomic reasoning modules without human interventions. For MATH, we observe a moderate gain of 1%-7% (2%-3%) on PaLM 2-L (GPT-4) from Self-Discover compared to the baselines. Upon error analysis (see Section 6.5.3 for details), we find that the reasoning structures generated by PaLM 2-L from Self-Discover are correct 87.5% of the time: human experts can follow the reasoning structures to solve the tasks perfectly. The majority of the failures (74.7%) comes from errors in executing the computations, consistent with prior findings [177]. 97 Table 6.3: Big Bench-Hard [136] per-task performance of GPT-4 and PaLM 2-L with Self-Discover. Big Bench-Hard Task Human (Avg.) Human (Max) GPT-4 Direct GPT-4 + CoT GPT-4 + Self-Discover PaLM 2-L Direct PaLM 2-L + CoT PaLM 2-L + Self-Discover boolean_expressions 79 100 73 83 85 71 84 84 causal_judgement 70 100 67 75 80 46 59 61 date_understanding 77 100 74 80 81 73 78 78 disambiguation_qa 67 93 60 70 80 54 50 57 dyck_languages 48 100 69 73 77 94 95 98 formal_fallacies 91 100 60 60 80 60 63 69 geometric_shapes 54 100 30 56 60 33 34 39 hyperbaton 75 100 68 69 76 80 75 82 logical_deduction_seven_objects 40 89 60 70 70 45 39 50 movie_recommendation 61 90 70 70 86 83 54 66 multistep_arithmetic_two 10 25 10 92 70 4 50 47 navigate 82 100 70 90 90 38 63 67 object_counting 86 100 90 100 100 27 44 70 penguins_in_a_table 78 100 80 100 90 70 67 75 reasoning_about_colored_objects 75 100 77 80 79 36 79 75 ruin_names 78 100 90 80 97 79 58 90 salient_translation_error_detection 37 80 40 50 70 56 48 60 snarks 77 100 73 89 97 58 62 86 sports_understanding 71 100 54 61 90 44 47 89 temporal_sequences 91 100 96 99 100 99 97 99 tracking_shuffled_objects_seven_objects 65 100 24 80 68 22 58 36 web_of_lies 81 100 15 80 71 54 42 67 word_sorting 63 100 65 90 85 12 4 15 6.4.2 Which Types of Problems Do Self-Discover Help the Most? Self-Discover performs best on tasks that require diverse world knowledge. Figure 6.5 presents the average improvement in terms of delta in accuracy of Self-Discover over direct answer and CoT on 4 categories of reasoning tasks we test. We adopt the categorization from [136]. We find that SelfDiscover improves over these two baselines on all categories, but especially on tasks that require world knowledge such as sports understanding, movie recommendation, and ruin names. These tasks demand models to reason using fact and general commonsense knowledge. We interpret Self-Discover’s advantages on these tasks as strength from integrating multiple reasoning modules from various perspectives as only applying CoT might miss key knowledge in the reasoning process. We observe that the gain on the Algorithmic category is moderate, consistent with the findings from Sec. 6.4.1 on MATH. 98 Multilingual Algorithmic NLU World Knowledge Problem Categories 0 5 10 15 20 25 Avg. Accuracy Delta (%) 4.0 14.2 14.3 19.8 12.0 5.2 8.0 19.7 Self-Discover Performance Improvement Across 4 Categories Self-Discover Over Direct Self-Discover Over CoT Figure 6.5: Breakdown of Self-Discover performance improvement on 4 categories on PaLM 2-L. Self-Discover performs the best on tasks requiring world knowledge. 6.4.3 How Efficient is Self-Discover? Self-Discover achieves better performance while requiring 10-40x fewer inference computer compared to self-consistency or majority voting. Here we examine a subset of 2 tasks from BBH and present a more thorough comparison of methods including those requiring many inference calls that are too costly to run on all 24 tasks. Figure 6.6 shows average accuracy and number of inference calls required per instance for each method using GPT-4. Accuracy wise (y-axis), we find that Self-Discover outperforms other baselines even those that require repeated inference calls such as CoT-self-consistency and majority voting of applying each RM. Efficiency wise (x-axis), Self-Discover only requires one call per instance and three more inference calls on the task-level, CoT-self-consistency requires 10 times more since we have to sample 10 times for each instance, and methods using each RM requires 40 times more as we use 40 RMs. In summary, Self-Discover presents itself a strong reasoning boosting method that is efficient to deploy on large-scale. 99 0 10 20 30 40 # of Inference Calls Per Instance 65 70 75 80 85 Average Accuracy BBH-Movie Recommendation 0 10 20 30 40 35 40 45 50 55 60 BBH-Geometric Shapes Self-Discover Direct CoT CoT+Self-Consistency Plan-and-Solve Majority voting each RM Best of each RM* Figure 6.6: Comparison of accuracy with number of inference calls required per instance. For CoTSelf-Consistency, we sample 10 times. Best of each RM method requires gold labels (*). Self-Discover requires only 1 inference call per instance (plus 3 more meta-prompts on the task-level), same as Direct and CoT while reaching better performance compared with 40x more call required methods (majority voting of each RM) on GPT-4. We acknowledge that Self-Discover input and output are longer than CoT and Direct prompting, increasing cost. However, as the number of instances increases, the efficiency of SelfDiscover in terms of inference per instance is highly desirable. 6.4.4 Qualitative Examples We show examples of model-discovered structures for different reasoning tasks in Figure 6.7 from PaLM 2-L. We observe that each structure is uniquely adapted to the task, integrates multiple reasoning modules, and provides insights on how to solve the tasks. Furthermore, example of comparing reasoning processes from CoT, Plan-and-Solve, and Self-Discover is shown in Figure 6.8. We find that CoT and Plan-andSolve makes incorrect assertions early and arrives at a wrong answer while following structure from SelfDiscover leads the model to generate logical conclusions (“path is closed as the beginning and ending coordinates are the same”) and arrive at the correct answer. 6.4.5 Additional Experiments We further include Tree-of-Thought [163] and Graph-of-Thought [12] (zero-shot versions) as baselines for comparison shown in Table 6.4. 100 reasoning_about_colored_objects { "Type and color of each item": "Number of items of each color": "Number of items of each type": "Number of items of each color and type": "Final answer": } causal_judgement { "Identify the chain of events in the story": "Identify the consequences of each event": "Identify the cause-and-effect relationships between events": "Choose a final answer based on the reasoning": } dyck_languages { "Parentheses that are not closed properly": "Stack to store the closing parentheses": "If the next symbol is a closing parenthesis, pop the stack and check if the popped symbol matches the next symbol": "If the stack is empty, add the next symbol to the stack": } Devise an algorithm Reflect on task nature Break down to sub-tasks Figure 6.7: Examples of self-discovered structures on BBH tasks using PaLM 2-L. We observe traits of atomic reasoning modules such as “step-by-step thinking”, “reflect on task nature”, and an interesting creative thinking case where models devise an algorithm using stack to solve parenthesis parsing task. To show the effectiveness of Self-Discover on more general tasks, we tested on a subset of MMLU (10 subtasks, with 50 diverse questions each, all randomly sampled) and results are shown in Table 6.5. We find that GPT-4+Self-Discover wins GPT-4+CoT in zero-shot on 7 out of 10, ties on 2 out of 10, and loses on 1 out of 10 tasks. In addition, we tried Self-Discover on the instance-level, where for each question, we run stage 1 to output the reasoning structure, then solve the task. We find that the instance-level SelfDiscover performs even better on MMLU, outperforming CoT by 7.2% on average for all tasks. This result, combined with those in main content, shows that the strength of Self-Discover spans across two types of tasks: for well-defined hard reasoning tasks such as BBH, task-level Self-Discover works well while being very efficient; for very open-domain tasks such as MMLU, we can do instance-level Self-Discover, which significantly outperforms CoT while still fewer inference required than self-consistency. 101 Table 6.4: Additional baselines including Tree-of-Thought (ToT) and Graph-of-Thought (GoT). Method BBH T4D MATH PaLM 2-L 56% 30% 45% PaLM 2-L + CoT 60% 40% 42% PaLM 2-L + ToT 58% 41% 44.5% PaLM 2-L + GoT 60% 40% 40% PaLM 2-L + PS 61% 42% 49% PaLM 2-L + Self-Discover 67% 69% 50.5% GPT-4 58% 51% 70.5% GPT-4 + CoT 75% 52% 71% GPT-4 + ToT 76% 50% 69% GPT-4 + GoT 75% 52% 70% GPT-4 + PS 73% 53% 70% GPT-4 + Self-Discover 81% 85% 73% Table 6.5: MMLU [136] per-task performance of GPT-4 and PaLM 2-L with Self-Discover. We sampled 10 tasks with 50 examples each. SD (instance) refers to that we run stage one on each question and use the generated structure during solving, to acount for the diversity of questions. MMLU Tasks GPT-4 Direct GPT-4 + CoT GPT-4 +SD GPT-4 +SD (instance) PaLM 2-L Direct PaLM 2-L + CoT PaLM 2-L + SD PaLM 2-L+SD (instance) business_ethics 78 83 85 91 72 77 80 83 high_school_world_history 64 69 74 83 54 59 61 66 machine_learning 72 80 81 88 70 75 75 78 college_medicine 45 52 50 54 44 45 45 49 high_school_statistics 68 75 75 84 60 66 68 73 international_law 70 77 77 82 60 69 63 71 conceptual_physics 62 66 70 74 59 64 65 69 marketing 71 75 76 82 67 69 71 74 jurisprudence 60 70 74 76 55 60 64 69 moral_disputes 62 68 69 73 60 65 66 68 102 This SVG path element draws a: (A) circle (B) heptagon (C) hexagon (D) kite (E) line (F) octagon (G) pentagon(H) rectangle (I) sector (J) triangle This SVG path consists of multiple lines drawn between different points. These points do not form a regular geometric shape such as a circle, pentagon, heptagon, hexagon, or octagon. … They do, however, seem to form a shape with four sides, which could be considered a form of a kite… Thus the final answer is (D) kite. Let’s think step-by-step (CoT) Let’s devise a plan and solve the problem (Plan-and-Solve) This SVG Path element consists of many "L" commands which draw a straight line from the current point to the new point. But it does not close the path, meaning the shape is not a regular polygon like a hexagon, heptagon, octagon or pentagon. The instructions make 4 lines in total but they are not straight and do not form a parallelogram or a rectangle. … Thus the final answer is (D) kite. Self-Discover Reasoning Structure (Ours) { "Simplify SVG Path": ... "Breakdown of Path Commands": { "Move to Command (M)": "Sets the starting point for the next command without drawing anything.", "Line to Command (L) steps": {"Start and end coordinates of each line segment": "M 55.57,80.69 L 57.38,65.80: From point (55.57, 80.69) to (57.38, 65.80)"}, … and finally closing the shape at (55.57, 80.69)"} "Critical Thinking Analysis": { "Logical Reasoning": { "Analysis of path continuity": "The path includes multiple line segments that connect distinct points. The path ends by connecting back to the starting point, indicating a closed shape.", "Identification of closed shapes": "The final line segment connects the last point back to the first point, which is characteristic of a closed shape."}, … "Final Reasoning and Decision": "With 7 distinct points all connected in a closed path, the shape formed is a heptagon.", "Final Answer": "B) heptagon"} Figure 6.8: Comparison of generated reasoning process from CoT, Plan-and-Solve, and SelfDiscover on BBH-geometric shape task. Both CoT and Plan-and-Solve incorrectly asserts that the path does not form a regular shape as it is not a closed path (highlighted in red) and arrive at a wrong answer. The reasoning structure (in blue Courier font) from Self-Discover first breaks down each line segment and analyze the coordinates carefully, then leverages logical reasoning to conclude that it forms a closed shape as the path ends at the same coordinate (highlighted in purple and orange), and selects the correct answer through final reasoning. 6.5 Deep Diving Into Self-Discovered Reasoning Structures After experimental results showing the effectiveness and efficiency of Self-Discover on a range of reasoning tasks, this section further analyzes are all actions of Self-Discover needed and what other benefits can self-discovered structures bring? In Sec. 6.5.1, we show that it is critical to the model’s performance to use the reasoning structures discovered through the three steps of SELECT, ADAPT and IMPLEMENT. In Sec. 6.5.2, we demonstrate the universality of the self-discovered reasoning structures by (1) applying the structures discovered by PaLM 2-L to GPT-4, (2) applying the structures discovered by GPT-4 to Llama-2-70B. We further show the commonalities between the reasoning structures and human reasoning patterns in Section 6.5.4. 103 Snarks Movie T4D Geometry Tasks 30 40 50 60 70 80 90 100 Accuracy 89 73 50 43 90 73 65 50 93 79 80 45 94 86 85 61 Ablaton Studies on 3 Self-Discover Actions: SELECT, ADAPT, IMPLEMENT (SAI) CoT Ours-S Ours-SA Ours-SAI Figure 6.9: Ablation study on three Self-Discover actions on 4 reasoning tasks: all three actions are beneficial for task-solving. 6.5.1 Importance of Self-Discover Actions We conduct ablation study on the three actions: SELECT, ADAPT, and IMPLEMENT to analyze the effects of Self-Discover actions. Figure 6.9 show results using GPT-4 on 4 reasoning tasks when we apply SELECT (-S) or apply SELECT and ADAPT (-SA) or apply all three actions. We find that with each stage, model’s zero-shot reasoning capability improve consistently across tasks, indicating that all three actions are beneficial. In particular, after all three actions SAI, the reasoning structures are adapted to be task specific, and bring the most gain to solving the reasoning tasks. 6.5.2 Towards Universality of Discovered Reasoning Structures Applying PaLM 2-L Discovered Structures to GPT-4 We first use a PaLM 2-L model to discover the reasoning structures of 4 reasoning tasks. Then, we apply the resulting reasoning structures to the decoding of GPT-4 as grounding. We compare our approach to OPRO [162] which discovered zero-shotprompts through optimizations. We apply OPRO prompts optimized using PaLM 2-L on each task to GPT-4 on the same reasoning tasks. Figure 6.10 shows that Self-Discover outperforms OPRO on 3 out of 4 tasks 104 Snarks Movie T4D Geometry Tasks 20 30 40 50 60 70 80 90 100 Accuracy 94 74 52 43 94 79 79 57 Transferrability of PaLM 2-L Optimized Prompts/Structures on GPT-4 OPRO* Self-Discover Figure 6.10: Transferrability tests of optimized prompts (OPRO) and composed structures (SelfDiscover). The results shown are from GPT-4 using the prompts and structures optimized or composed using PaLM 2-L. We find that self-discovered reasoning structure transfers more robustly than optimized prompts. despite that OPRO used 20% data to optimize the prompt. In contrast, Self-Discover is done in a zeroshot manner, demonstrating the efficiency of our method and universality of the discovered reasoning structures. Applying GPT-4 Discovered Structures to Llama2 and ChatGPT Motivated by transferrability performance across LLMs, we further investigate can self-discovered reasoning structures from LLMs boost reasoning for smaller LMs that are challenging to come up with structures themselves‡ . We use GPT-4 to discover the task-intrinsic reasoning structures, and then apply those structures to the decoding of opensourced Llama2-70B as well as GPT-3.5-turbo (ChatGPT) on two subsets of tasks from BBH. We find that using self-discovered structures on Llama2 (52%) outperforms CoT (42%) on disambiguation QA zero-shot ‡We tried zero-shot meta prompting Llama2 but observed low-quality structure outputs. 105 and on GPT-3.5-turbo (56%) outperforms CoT (51%) on geometry with 3-shot demonstration from structured reasoning process. 6.5.3 Error Analysis We perform an error analysis of Self-Discover on the MATH dataset of 200 samples to understand the failure modes. We manually annotate whether the generated reasoning structure is correct or not together with whether the correctness of model prediction using Self-Discover. A reasoning structure is defined as correct if a human expert can solve the task by simply following the reasoning structure. Out of 200 examples, we find that 87.5% (175) examples have correct reasoning structures. 12.5% (25) examples have incorrect reasoning structures leading to prediction errors. Table 6.6 shows 4 such examples where the LLM misunderstands the task, or makes an error in one of the steps or adds unnecessary steps in the reasoning structure. Next, we analyze the errors made by the model in Self-Discover: out of 99 examples where the model prediction is wrong, wrong reasoning structures account for only 25.3% of the errors. The remaining 74.7% errors are due to errors in the intermediate calculations such as math computations. Table 6.7 shows 3 examples of such errors. This insight indicates that future improvements should aim at improving the step-wise calculation accuracy of LLMs, such as using tools or code generation. 6.5.4 Further Anaysis Model-Discovered Reasoning Structures vs. Human Reasoning Patterns We investigate whether LLM-discovered reasoning structures share some commonalities with human reasoning patterns. We give humans 3 task instances without labels and an example reasoning structure (same as Self-Discover metareasoning stage) and ask them to write a reasoning structure for a task before solving it. Figure 6.11 shows comparison of human and LLM-composed reasoning structures on the BBH-navigation task. We observe 106 Table 6.6: Examples of wrong reasoning structures for MATH. The first error in the reasoning structure is highlighted in red. Prompt Reasoning Structure Error How many numbers between 1 and 2005 are integer multiples of 3 or 4 but not 12? 1. Find the number of multiples of 3 between 1 and 2005. 2. Find the number of multiples of 4 between 1 and 2005. 3. Find the number of multiples of 12 between 1 and 2005. 4. Add the number of multiples of 3 and 4. 5. Subtract the number of multiples of 12 (to avoid double counting). 6. The result is the number of integers between 1 and 2005 that are multiples of 3 or 4 but not 12. Need to subtract the number of multiples of 12 twice instead of once. How many numbers are in the list 6,7,10,11,14,15,..., 94,95,98? 1. Find the number of pairs in the list. 2. Find the number of numbers in each pair. 3. Multiply the number of pairs by the number of numbers in each pair to find the total number of numbers in the list. LLM misunderstands the task: the reasoning structure is irrelevant. A restaurant offers three desserts, and exactly twice as many appetizers as main courses. A dinner consists of an appetizer, a main course, and a dessert. What is the least number of main courses that the restaurant should offer so that a customer could have a different dinner each night in the year 2003? 1. Define the variables: - M: number of main courses - A: number of appetizers (twice the number of main courses) - D: number of desserts (fixed at 3) - T: total number of possible dinner combinations 2. Calculate the total number of possible dinner combinations: - T = M * A * D 3. Set the total number of possible dinner combinations equal to the number of nights in the year 2003: - T = 2003 4. Substitute the expression for A in terms of M: - T = M * (2 * M) * 3 5. Solve the equation for M: - 6M2 = 2003 6. Find the smallest integer value of M that satisfies the equation. There are 365 nights in the year 2003. Determine the number of ways to arrange the letters of the word NINE. 1. Count the total number of letters in the word NINE. 2. Identify the number of times each letter appears in the word. 3. Calculate the factorial of the total number of letters. 4. Divide the factorial by the factorial of the number of times each letter appears. 5. Multiply the results from step 4 for each unique letter. } Step 5 is unnecessary and leads to error in the final result. 107 Table 6.7: Examples of wrong calculations for MATH. The first error in the intermediate computations is highlighted in red. Prompt Self-Discover Prediction Error How many ways are there to arrange 6 people around a circular table with 7 seats? (Two seatings are considered the same if one is a rotation of the other.) "1. Calculate the total number of ways to arrange 6 people in 7 seats without considering rotations as the same.": 7 * 6 * 5 * 4 * 3 * 2 = 5040 - There are 7 choices for the first person. - There are 6 choices for the second person. - There are 5 choices for the third person. - There are 4 choices for the fourth person. - There are 3 choices for the fifth person. - There are 2 choices for the sixth person. "2. Divide the result from step 1 by the number of rotations that are considered the same.": 5040 / 6 = 840 - There are 6 rotations that are considered the same. There are 7 rotations instead of 6 that are considered the same. A customer ordered 15 pieces of gourmet chocolate. The order can be packaged in small boxes that contain 1, 2 or 4 pieces of chocolate. Any box that is used must be full. How many different combinations of boxes can be used for the customer’s 15 chocolate pieces? One such combination to be included is to use seven 2-piece boxes and one 1-piece box. "1. Find the number of ways to package the chocolates using only boxes of 1 piece.": 1, "2. Find the number of ways to package the chocolates using only boxes of 2 pieces.": 1, "3. Find the number of ways to package the chocolates using only boxes of 4 pieces.": 0, "4. Find the number of ways to package the chocolates using a combination of boxes of 1 and 2 pieces.": 7, "5. Find the number of ways to package the chocolates using a combination of boxes of 1 and 4 pieces.": 3, "6. Find the number of ways to package the chocolates using a combination of boxes of 2 and 4 pieces.": 2, "7. Find the number of ways to package the chocolates using a combination of boxes of 1, 2, and 4 pieces.": 4, "8. Sum up the total number of ways from the seven cases above.": 1 + 1 + 0 + 7 + 3 + 2 + 4 = 18 There is 0 way to package 15 chocolates using only boxes of 2 pieces. How many primes are in the row of Pascal’s Triangle that starts with a 1 followed by a 6? "1. Identify the row of Pascal’s Triangle that starts with a 1 followed by a 6.": "1, 6, 15, 20, 15, 6, 1", "2. List all the numbers in that row.": [1, 6, 15, 20, 15, 6, 1], "3. Check each number in the list for primality.": [False, True, False, False, False, True, False], "4. Count the number of primes found in the list.": 2 6 is not a prime. 108 Human-Written Structure: { “Position after instruction 1”: “Position after instruction 2”: … “Position after instruction n”: “Is final position the same as starting position?”: } Model-Discovered Structure: { "Break down instructions into individual movements": { "Instruction 1": "", "Effect on position after Instruction 1": "", "Instruction 2": "", "Effect on position after Instruction 2": "", … "Additional instructions if present": "" }, "Simplify the sequence of movements": { "Simplified representation of series": "" } … } Shared Structure: Step-wise mental notes Task-Navigation: If you follow these instructions, do you return to the starting point? Always face forward. Take 1 step backward. Take 9 steps left. Take 2 steps backward. Take 6 steps forward. Take 4 steps forward. Take 4 steps backward. Take 3 steps right. Figure 6.11: Case study of human-written structure shares commonalities with LLM-discovered reasoning structure. We observe similar reasoning patterns–both structures contain step-wise analysis of each instruction. similar structures such as mental-noting after each movement. From promising findings of LLM selfdiscovered structures boost and share traits of human meta-reasoning, we hope to encourage more future work to study humna-AI collaboration for complex problem-solving. 6.6 Conclusion We introduce Self-Discover, an efficient and performant framework for models to self-discover a reasoning structure for any task from a seed set of general problem-solving skills. We observe drastic improvements on challenging reasoning benchmarks from multiple LLMs up to 30%. Ablations study of Self-Discover demonstrates that the composed reasoning structures are universally transferable between LLMs. Forward looking, we are excited to explore more on LLM structured reasoning to push the boundary of problem-solving and discover potentials for Human-AI collaboration. 109 Chapter 7 Conclusions and Future Directions This dissertation has explored the critical role of common ground reasoning in the development of communicative agents, aiming to bridge the gap between human and artificial intelligence in natural language communication. By enhancing conversational models with common sense knowledge [182, 183, 181], modeling theory-of-mind capabilities [187, 185], and eliciting metacognition through dynamic reasoning strategies [186], this work has made significant strides towards creating AI agents that can engage in rich, effective interactions with humans. 7.1 Impact and Contributions The contributions of this thesis are multifaceted. Firstly, the integration of external knowledge bases and the development of novel methods for implicit knowledge representation have shown to substantially improve an agent’s understanding of conversational context and unstated information. This advancement is crucial for creating more intuitive and context-aware AI systems. Secondly, by equipping agents with theory-of-mind capabilities, this research has demonstrated how AI can model the mental states of their conversational partners in a goal-driven scenario. This capability allows agents to reason about beliefs, desires, and intentions, leading to more effective and goal-oriented 110 dialogues. This advancement in goal-driven dialogue models can significantly enhance human-agent collaboration, making AI systems more useful in practical applications. Finally, the elicitation of metacognition by planning reasoning strategies for diverse scenarios has underscored the importance of adaptability in communicative agents. By demonstrating metacognitive abilities, agents can navigate complex communicative scenarios more effectively, ensuring that their interactions remain relevant and contextually appropriate. 7.2 Limitations Despite these advancements, several limitations remain. The integration of external knowledge bases, while beneficial, can introduce challenges related to the accuracy and relevance of the knowledge retrieved. Ensuring that the knowledge bases are up-to-date and contextually appropriate is an ongoing challenge. The modeling of theory-of-mind capabilities, while promising, is still in its nascent stages. Accurately inferring the mental states of conversational partners requires a deep understanding of human psychology, which current models only approximate. Further research is needed to refine these models and improve their accuracy, especially in a more realistic and dynamic scenarios. Additionally, the elicitation of metacognition presents its own set of challenges. Developing agents that can dynamically adapt their strategies based on context and goals requires sophisticated planning algorithms and a deep understanding of the scenarios they might encounter. This area remains an open field for further exploration and refinement. 7.3 Future Directions The current state of developing capable communicative agents is fast-moving with new emerging challenges. I believe that to build helpful and general AI agents, we need to shift from only conversational 111 models to generalist agents that can communicate with humans and AI using natural language and cooperate smoothly. The work presented in this dissertation lays the groundwork for several future research directions. 1. New modes of human-AI interaction will emerge. How can we design proper interaction paradigms and prepare AI models in new interaction modes? 2. As models become increasingly powerful, how can we supervise and evaluate models that are stronger than us? 3. How should we prepare for the societal and economical impact of stronger AIs? What are the best practices to prevent harmful outcomes. Regarding to these three challenges, my future work follows: 7.3.1 Self-Learning with Evolving Common Ground We want AI collaborators that adapt to our knowledge and working styles. However, LLMs are stateless, meaning that they do not have clear memory of prior interactions. I ask a central question: how can we make them self-learn from experience? A self-evolving models with dynamic CG modeling have significant implications on applications such as personalized AI, lifelong learning agents that improve iteratively, and approaching superhuman intelligence. I propose a two-stage pipeline using common ground as a central module. Stage 1: Distilling insights from interactions. We can devise an external memory base that stores prior interactions with users and derive insights from past experience. For example, a writing assistant AI can note down prior interactions with a user and derive insights such as the user’s writing style and preference in wordings to better help in future sessions. Stage 2: Refining CG Reasoning Structure. Inspired by the effectiveness of structured reasoning (Chapter 6, we can utilize reasoning structure as a proxy to the common ground with users 112 and put derived insights from past interactions to the structure. The two-stage pipeline can be iterated to continually adapting to different users and self-improve. 7.3.2 Interactive Super-Alignment AI models will likely have more knowledge than humans in increasing number of domains, as we’ve already seen signs on areas such as annotating, translating (some languages), and even writing social media posts. Then the critical question becomes: How do we supervise AI and make decisions on problems we do not know the answers to? This is closely related to the field of scalable oversight in AI and have significant impacts on applications including scientific discovery and future of work. I propose to consider two sub-questions: 1) How to supervise AIs smarter than us? and 2) How to consult AIs on problems we don’t know ground truth? For the first one, I plan to utilize ToM, a part of CG reasoning, to perform knowledge estimation, in order to build models that can generalize from imperfect signals. Since humans do not have 100% expertise on all domains, it is crucial for AI models to recognize the drawbacks from human input and use ToM to estimate knowledge gap and what to improve. For the second question, I propose Gamification-as-Alignment, i.e., use AI-AI interaction for better decisionmaking. There has already been work showing that having two models debating each other and a huamn reading the transcript can yield higher human task-solving rates than consulting a single instance of an LLM. 7.3.3 Safety to Deploy in the Wild from Robust Reasoning Last but not least, LLMs are being increasingly used in safety-critical domains such as medical and education but suffer from jailbreaking and backdoor attacks. The important question is thus: with more capable AI agents with increasing autonomy, how do we ensure safety against malicious usage? Solving 113 this research questions has key implications on security and privacy of AI in general and potential harms of misalignment. I propose to tackle safety concern by introducing a new meta morality reasoning module based on CG and interaction history. Consider the case of jailbreaking LLM, a model will refuse to give the tutorial on how to make a bomb if asked directly, but will willingly give the tutorial if you append random strings after the request. I argue that such behaviors are caused by the lack of robustness in reasoning. If the model can successfully reason that the user’s intent is still getting the tutorial on how to make a bomb, which has harmful impacts, it should reject the request no matter what the request is embedded in. Thus, I plan to include meta morality reasoning in the CG reasoning module which considers two types of ethics guidelines: 1) generally-designed external guidelines such as constitions of AI and 2) scenario-specfiic ethics that have to inferred from the current interaction and with common ground, e.g., “Is the user’s intent to hurt others?”. Synergies between general guidelines and scenario-specific morality CG reasoning should enhance the robustness and thus make safer AI agents. In conclusion, this dissertation has pushed the boundaries of common ground reasoning in AI, contributing to the development of communicative agents that can effectively collaborate and interact with humans in natural and intuitive ways. While significant challenges remain, the advancements made in this work provide a solid foundation for future research, paving the way towards more capable and reliable AI systems. I firmly believe that developing communicative agents that benefit human society is one of the most important task to work on and I’m beyond excited to continue working on this direction to ensure more connected human-AI communication. 114 Bibliography [1] Leonard Adolphs, Kurt Shuster, Jack Urbanek, Arthur Szlam, and Jason Weston. “Reason first, then respond: Modular Generation for Knowledge-infused Dialogue”. In: arXiv preprint arXiv:2111.05204 (2021). [2] Jens Allwood. Linguistic communication as action and cooperation. University of Göteborg. Department of Linguistics, 1976. [3] Jacopo Amidei, Paul Piwek, and Alistair Willis. “The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations”. In: (2019). [4] Prithviraj Ammanabrolu, Renee Jia, and Mark O Riedl. “Situated Dialogue Learning through Procedural Environment Generation”. In: Association for Computational Linguistics (ACL). 2022. url: https://arxiv.org/abs/2110.03262. [5] Prithviraj Ammanabrolu, Jack Urbanek, Margaret Li, Arthur Szlam, Tim Rocktäschel, and Jason Weston. “How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 807–833. [6] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. “Palm 2 technical report”. In: arXiv preprint arXiv:2305.10403 (2023). [7] Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. “Video pretraining (vpt): Learning to act by watching unlabeled online videos”. In: arXiv preprint arXiv:2206.11795 (2022). [8] Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang, and Markus Zijlstra. “Human-level play in the game of Diplomacy by combining language models with strategic reasoning”. In: Science 378.6624 (2022), pp. 1067–1074. doi: 10.1126/science.ade9097. eprint: https://www.science.org/doi/pdf/10.1126/science.ade9097. 115 [9] Satanjeev Banerjee and Alon Lavie. “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments”. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005, pp. 65–72. [10] Cristian-Paul Bara, CH-Wang Sky, and Joyce Chai. “MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021, pp. 1112–1125. [11] Simon Baron-Cohen, Alan M Leslie, and Uta Frith. “Does the autistic child have a “theory of mind”?” In: Cognition 21.1 (1985), pp. 37–46. [12] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. “Graph of thoughts: Solving elaborate problems with large language models”. In: arXiv preprint arXiv:2308.09687 (2023). [13] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009. [14] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: AAAI (2020). [15] David Bohm, Peter M Senge, and Lee Nichol. On dialogue. Routledge, 2004. [16] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4762–4779. doi: 10.18653/v1/P19-1470. [17] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [18] Jerome S Bruner. “The act of discovery.” In: Harvard educational review (1961). [19] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. “MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 5016–5026. [20] Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. “(Meta-) Evaluation of Machine Translation”. In: Proceedings of the Second Workshop on Statistical Machine Translation. Prague, Czech Republic: Association for Computational Linguistics, June 2007, pp. 136–158. url: https://aclanthology.org/W07-0718. 116 [21] Chris Callison-Burch, Gaurav Singh Tomar, Lara Martin, Daphne Ippolito, Suma Bailis, and David Reitter. “Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence”. In: The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). Abu Dhabi, UAE, 2022. [22] Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. “Evaluation of text generation: A survey”. In: arXiv preprint arXiv:2006.14799 (2020). [23] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks”. In: arXiv preprint arXiv:2211.12588 (2022). [24] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “Palm: Scaling language modeling with pathways”. In: arXiv preprint arXiv:2204.02311 (2022). [25] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. “Scaling instruction-finetuned language models”. In: arXiv preprint arXiv:2210.11416 (2022). [26] Herbert H Clark and Susan E Brennan. “Grounding in communication.” In: (1991). [27] Herbert H Clark and Edward F Schaefer. “Contributing to discourse”. In: Cognitive science 13.2 (1989), pp. 259–294. [28] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. “Training verifiers to solve math word problems”. In: arXiv preprint arXiv:2110.14168 (2021). [29] Kate Crawford. The atlas of AI: Power, politics, and the planetary costs of artificial intelligence. Yale University Press, 2021. [30] Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and Ming Zhou. “MuTual: A Dataset for Multi-Turn Dialogue Reasoning”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 1406–1416. [31] Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. “Collaborating with language models for embodied reasoning”. In: arXiv preprint arXiv:2302.00763 (2023). [32] Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and João Sedoc. “Decoding Methods for Neural Narrative Generation”. In: Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021). 2021, pp. 166–185. [33] Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. “Survey on evaluation methods for dialogue systems”. In: Artificial Intelligence Review (2020), pp. 1–56. 117 [34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [35] Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. “Wizard of wikipedia: Knowledge-powered conversational agents”. In: arXiv preprint arXiv:1811.01241 (2018). [36] Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. “Compositional semantic parsing with large language models”. In: arXiv preprint arXiv:2209.15003 (2022). [37] Mihail Eric, Nicole Chartier, Behnam Hedayatnia, Karthik Gopalakrishnan, Pankaj Rajan, Yang Liu, and Dilek Hakkani-Tur. “Multi-Sentence Knowledge Selection in Open-Domain Dialogue”. In: Proceedings of the 14th International Conference on Natural Language Generation. 2021, pp. 76–86. [38] Jonathan St BT Evans. “In two minds: dual-process accounts of reasoning”. In: Trends in cognitive sciences 7.10 (2003), pp. 454–459. [39] Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. “Promptbreeder: Self-referential self-improvement via prompt evolution”. In: arXiv preprint arXiv:2309.16797 (2023). [40] Joseph L Fleiss. “Measuring nominal scale agreement among many raters.” In: Psychological bulletin 76.5 (1971), p. 378. [41] Michael C Frank and Noah D Goodman. “Predicting pragmatic reasoning in language games”. In: Science 336.6084 (2012), pp. 998–998. [42] Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, and Aida Nematzadeh. “Pragmatics in Grounded Language Learning: Phenomena, Tasks, and Modeling Approaches”. In: arXiv preprint arXiv:2211.08371 (2022). [43] Uta Frith and Christopher D Frith. “Development and neurophysiology of mentalizing”. In: Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 358.1431 (2003), pp. 459–473. [44] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. 2020. arXiv: 2004.07219 [cs.LG]. [45] Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D Goodman. “Understanding social reasoning in language models with language models”. In: arXiv preprint arXiv:2306.15448 (2023). [46] Kanishk Gandhi, Dorsa Sadigh, and Noah D Goodman. “Strategic Reasoning with Language Models”. In: arXiv preprint arXiv:2305.19165 (2023). 118 [47] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. “Pal: Program-aided language models”. In: International Conference on Machine Learning. PMLR. 2023, pp. 10764–10799. [48] Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. “A knowledge-grounded neural conversation model”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018. [49] Deepanway Ghosal, Siqi Shen, Navonil Majumder, Rada Mihalcea, and Soujanya Poria. “CICERO: A Dataset for Contextualized Commonsense Inference in Dialogues”. In: arXiv preprint arXiv:2203.13926 (2022). [50] Noah D Goodman and Michael C Frank. “Pragmatic language interpretation as probabilistic inference”. In: Trends in cognitive sciences 20.11 (2016), pp. 818–829. [51] Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. “Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations.” In: INTERSPEECH. 2019, pp. 1891–1895. [52] Herbert P Grice. “Logic and conversation”. In: Speech acts. Brill, 1975, pp. 41–58. [53] Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. “A real-world webagent with planning, long context understanding, and program synthesis”. In: arXiv preprint arXiv:2307.12856 (2023). [54] Jürgen Habermas. The theory of communicative action: Volume 1: Reason and the rationalization of society. Vol. 1. Beacon press, 1985. [55] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. “Reasoning with language model is planning with world model”. In: arXiv preprint arXiv:2305.14992 (2023). [56] Francesca GE Happé. “Communicative competence and theory of mind in autism: A test of relevance theory”. In: Cognition 48.2 (1993), pp. 101–119. [57] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103.03874 [cs.LG]. [58] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. “Measuring Mathematical Problem Solving With the MATH Dataset”. In: Sort 2.4 (), pp. 0–6. [59] Mark K Ho, Rebecca Saxe, and Fiery Cushman. “Planning with theory of mind”. In: Trends in Cognitive Sciences 26.11 (2022), pp. 959–971. [60] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The Curious Case of Neural Text Degeneration”. In: International Conference on Learning Representations. 2019. 119 [61] Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. “Cosmos qa: Machine reading comprehension with contextual commonsense reasoning”. In: arXiv preprint arXiv:1909.00277 (2019). [62] Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. “GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 9230–9240. [63] Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. “COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs”. In: AAAI. 2021. [64] Daniel Kahneman. Thinking, fast and slow. Macmillan, 2011. [65] Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. “Decomposed Prompting: A Modular Approach for Solving Complex Tasks”. In: The Eleventh International Conference on Learning Representations. 2022. [66] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. “Skip-thought vectors”. In: Advances in neural information processing systems. 2015, pp. 3294–3302. [67] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. “Large language models are zero-shot reasoners”. In: Advances in neural information processing systems 35 (2022), pp. 22199–22213. [68] Michal Kosinski. “Theory of mind may have spontaneously emerged in large language models”. In: arXiv preprint arXiv:2302.02083 (2023). [69] Kirby Kuznia, Swaroop Mishra, Mihir Parmar, and Chitta Baral. “Less is More: Summary of Long Instructions is Better for Program Synthesis”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 4532–4552. [70] J Richard Landis and Gary G Koch. “The measurement of observer agreement for categorical data”. In: biometrics (1977), pp. 159–174. [71] Matthew Le, Y-Lan Boureau, and Maximilian Nickel. “Revisiting the evaluation of theory of mind through question answering”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 5872–5877. [72] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. “Zero-shot relation extraction via reading comprehension”. In: arXiv preprint arXiv:1706.04115 (2017). 120 [73] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 7871–7880. doi: 10.18653/v1/2020.acl-main.703. [74] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset”. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2017, pp. 986–995. [75] Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text summarization branches out. 2004, pp. 74–81. [76] Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. “How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation”. In: arXiv preprint arXiv:1603.08023 (2016). [77] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. In: arXiv preprint arXiv:2107.13586 (2021). [78] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [79] Ryan Lowe, Michael Noseworthy, Iulian V Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. “Towards an automatic turing test: Learning to evaluate dialogue responses”. In: arXiv preprint arXiv:1708.07149 (2017). [80] Kyle Mahowald, Anna A Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum, and Evelina Fedorenko. “Dissociating language and thought in large language models: a cognitive perspective”. In: arXiv preprint arXiv:2301.06627 (2023). [81] Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, and Aram Galstyan. “Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources”. In: arXiv preprint arXiv:2103.11320 (2021). [82] Shikib Mehri and Maxine Eskenazi. “Unsupervised evaluation of interactive dialog with dialogpt”. In: arXiv preprint arXiv:2006.12719 (2020). [83] Shikib Mehri and Maxine Eskenazi. “Usr: An unsupervised and reference free evaluation metric for dialog generation”. In: arXiv preprint arXiv:2005.00456 (2020). [84] Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, et al. “LILA: A Unified Benchmark for Mathematical Reasoning”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 5807–5832. 121 [85] Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi. “Reframing Instructional Prompts to GPTk’s Language”. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022, pp. 589–612. [86] Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. “Cross-Task Generalization via Natural Language Crowdsourcing Instructions”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 3470–3487. [87] Swaroop Mishra and Elnaz Nouri. “HELP ME THINK: A Simple Prompting Strategy for Non-experts to Create Customized Content with Models”. In: Findings of the Association for Computational Linguistics: ACL 2023. Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 11834–11890. doi: 10.18653/v1/2023.findings-acl.751. [88] Anjali Narayan-Chen, Prashant Jayannavar, and Julia Hockenmaier. “Collaborative dialogue in Minecraft”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 5405–5415. [89] Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, and Tom Griffiths. “Evaluating Theory of Mind in Question Answering”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, pp. 2392–2400. [90] Allen Newell, John Calman Shaw, and Herbert A Simon. “Elements of a theory of human problem solving.” In: Psychological review 65.3 (1958), p. 151. [91] Yixin Nie, Mary Williamson, Mohit Bansal, Douwe Kiela, and Jason Weston. “I like fish, especially dolphins: Addressing Contradictions in Dialogue Modelling”. In: arXiv preprint arXiv:2012.13391 (2020). [92] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. “Show your work: Scratchpads for intermediate computation with language models”. In: arXiv preprint arXiv:2112.00114 (2021). [93] OpenAI. ChatGPT: Optimizing Language Models for Dialogue. 2022. url: https://openai.com/blog/chatgpt/ (visited on 12/18/2022). [94] OpenAI. JSON Generation Mode. 2023. url: https://platform.openai.com/docs/guides/text-generation/json-mode. [95] R OpenAI. “GPT-4 technical report”. In: arXiv (2023), pp. 2303–08774. [96] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 27730–27744. 122 [97] Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani-Tur. “Teach: Task-driven embodied agents that chat”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. 2022, pp. 2017–2025. [98] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “Bleu: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002, pp. 311–318. [99] Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. “Generative agents: Interactive simulacra of human behavior”. In: arXiv preprint arXiv:2304.03442 (2023). [100] Pruthvi Patel, Swaroop Mishra, Mihir Parmar, and Chitta Baral. “Is a Question Decomposition Unit All We Need?” In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 4553–4569. [101] Josef Perner, Uta Frith, Alan M Leslie, and Susan R Leekam. “Exploration of the autistic child’s theory of mind: Knowledge, belief, and communication”. In: Child development (1989), pp. 689–700. [102] George Polya. How to solve it: A new aspect of mathematical method. Vol. 85. Princeton university press, 2004. [103] Shrimai Prabhumoye, Margaret Li, Jack Urbanek, Emily Dinan, Douwe Kiela, Jason Weston, and Arthur Szlam. “I love your chain mail! Making knights smile in a fantasy game world: Open-domain goal-oriented dialogue agents”. In: arXiv preprint arXiv:2002.02878 (2020). [104] David Premack and Guy Woodruff. “Does the chimpanzee have a theory of mind?” In: Behavioral and brain sciences 1.4 (1978), pp. 515–526. [105] Yewen Pu, Kevin Ellis, Marta Kryven, Josh Tenenbaum, and Armando Solar-Lezama. “Program synthesis with pragmatic communication”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 13249–13259. [106] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving language understanding with unsupervised learning”. In: (2018). [107] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language models are unsupervised multitask learners”. In: OpenAI Blog 1.8 (2019): 9. (2019). [108] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” In: J. Mach. Learn. Res. 21.140 (2020), pp. 1–67. [109] Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. “Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization”. In: arXiv preprint arXiv:2210.01241 (2022). 123 [110] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. “Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 5370–5381. [111] Jens Rasmussen. “Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human performance models”. In: IEEE transactions on systems, man, and cybernetics 3 (1983), pp. 257–266. [112] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019. url: http://arxiv.org/abs/1908.10084. [113] Kyle Richardson, Hai Hu, Lawrence S Moss, and Ashish Sabharwal. “Probing Natural Language Inference Models through Semantic Fragments.” In: AAAI. 2020, pp. 8713–8721. [114] Kyle Richardson and Ashish Sabharwal. “What does my qa model know? devising controlled probes using expert knowledge”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 572–588. [115] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. “Recipes for building an open-domain chatbot”. In: arXiv preprint arXiv:2004.13637 (2020). [116] Manasvi Sagarkar, John Wieting, Lifu Tu, and Kevin Gimpel. “Quality signals in generated stories”. In: Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. 2018, pp. 192–202. [117] Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. “Branch-solve-merge improves large language model evaluation and generation”. In: arXiv preprint arXiv:2310.15123 (2023). [118] Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. “Atomic: An atlas of machine commonsense for if-then reasoning”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019, pp. 3027–3035. [119] Maarten Sap, Ronan LeBras, Daniel Fried, and Yejin Choi. “Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs”. In: arXiv preprint arXiv:2210.13312 (2022). [120] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. “Social IQa: Commonsense Reasoning about Social Interactions”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 4453–4463. [121] Timo Schick, Jane Dwivedi-Yu, Roberto Dessı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “Toolformer: Language models can teach themselves to use tools”. In: arXiv preprint arXiv:2302.04761 (2023). 124 [122] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [123] Thibault Sellam, Dipanjan Das, and Ankur P Parikh. “BLEURT: Learning robust metrics for text generation”. In: arXiv preprint arXiv:2004.04696 (2020). [124] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. “Building end-to-end dialogue systems using generative hierarchical neural network models”. In: Thirtieth AAAI Conference on Artificial Intelligence. 2016. [125] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. “A hierarchical latent variable encoder-decoder model for generating dialogues”. In: Thirty-First AAAI Conference on Artificial Intelligence. 2017. [126] Patrick Shafto, Noah D Goodman, and Thomas L Griffiths. “A rational account of pedagogical reasoning: Teaching by, and learning from, examples”. In: Cognitive psychology 71 (2014), pp. 55–89. [127] Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. “Clever hans or neural theory of mind? stress testing social reasoning in large language models”. In: arXiv preprint arXiv:2305.14763 (2023). [128] Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. ““Nice Try, Kiddo”: Investigating Ad Hominems in Dialogue Responses”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 750–767. doi: 10.18653/v1/2021.naacl-main.60. [129] Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. “Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion”. In: arXiv preprint arXiv:2203.13224 (2022). [130] Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. “Unsupervised Commonsense Question Answering with Self-Talk”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 4615–4629. [131] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. “A neural network approach to context-sensitive generation of conversational responses”. In: arXiv preprint arXiv:1506.06714 (2015). [132] Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multilingual graph of general knowledge”. In: Thirty-first AAAI conference on artificial intelligence. 2017. [133] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. In: Transactions on Machine Learning Research (2023). [134] Robert C Stalnaker. “Assertion”. In: Pragmatics. Brill, 1978, pp. 315–332. 125 [135] Keith E Stanovich and Richard F West. “Individual differences in reasoning: Implications for the rationality debate?” In: Behavioral and brain sciences 23.5 (2000), pp. 645–665. [136] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. “Challenging big-bench tasks and whether chain-of-thought can solve them”. In: arXiv preprint arXiv:2210.09261 (2022). [137] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. “CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 4149–4158. [138] Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. “Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018. [139] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “LaMDA: Language Models for Dialog Applications”. In: arXiv preprint arXiv:2201.08239 (2022). [140] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. “Llama 2: Open foundation and fine-tuned chat models”. In: arXiv preprint arXiv:2307.09288 (2023). [141] Sean Trott, Cameron Jones, Tyler Chang, James Michaelov, and Benjamin Bergen. “Do Large Language Models know what humans know?” In: Cognitive Science 47.7 (2023), e13309. [142] Yi-Lin Tuan, Wei Wei, and William Yang Wang. “Knowledge Injection into Dialogue Generation via Language Models”. In: arXiv preprint arXiv:2004.14614 (2020). [143] Alan M Turing. Computing machinery and intelligence. Springer, 2009. [144] Tomer Ullman. “Large language models fail on trivial alterations to theory-of-mind tasks”. In: arXiv preprint arXiv:2302.08399 (2023). [145] Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. “Learning to Speak and Act in a Fantasy Text Adventure Game”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 673–683. doi: 10.18653/v1/D19-1062. [146] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aaPaper.pdf. 126 [147] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems. 2017, pp. 5998–6008. [148] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. “Cider: Consensus-based image description evaluation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 4566–4575. [149] Oriol Vinyals and Quoc Le. “A neural conversational model”. In: arXiv preprint arXiv:1506.05869 (2015). [150] Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models”. In: arXiv preprint arXiv:2305.04091 (2023). [151] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In: The Eleventh International Conference on Learning Representations. 2022. [152] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. “Finetuned Language Models are Zero-Shot Learners”. In: International Conference on Learning Representations. 2021. [153] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 24824–24837. [154] Joseph Weizenbaum. “ELIZA—a computer program for the study of natural language communication between man and machine”. In: Communications of the ACM 9.1 (1966), pp. 36–45. [155] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. “Neural Text Generation With Unlikelihood Training”. In: International Conference on Learning Representations. 2019. [156] Gordon Wells. “Dialogic inquiry in education”. In: Vygotskian perspectives on literacy research (2000), pp. 51–85. [157] Adina Williams, Nikita Nangia, and Samuel Bowman. “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 1112–1122. url: http://aclweb.org/anthology/N18-1101. [158] Heinz Wimmer and Josef Perner. “Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception”. In: Cognition 13.1 (1983), pp. 103–128. 127 [159] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. “Transformers: State-of-the-Art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. doi: 10.18653/v1/2020.emnlp-demos.6. [160] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. “Transfertransfo: A transfer learning approach for neural network based conversational agents”. In: arXiv preprint arXiv:1901.08149 (2019). [161] Yan Xu, Etsuko Ishii, Zihan Liu, Genta Indra Winata, Dan Su, Andrea Madotto, and Pascale Fung. “Retrieval-Free Knowledge-Grounded Dialogue Response Generation with Adapters”. In: arXiv preprint arXiv:2105.06232 (2021). [162] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. “Large language models as optimizers”. In: arXiv preprint arXiv:2309.03409 (2023). [163] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. “Tree of thoughts: Deliberate problem solving with large language models”. In: arXiv preprint arXiv:2305.10601 (2023). [164] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models”. In: International Conference on Learning Representations (ICLR). 2023. [165] Yao Yao, Zuchao Li, and Hai Zhao. “Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models”. In: arXiv preprint arXiv:2305.16582 (2023). [166] Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H Chi, and Denny Zhou. “Large Language Models as Analogical Reasoners”. In: arXiv preprint arXiv:2310.01714 (2023). [167] Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. “A Comprehensive Assessment of Dialog Evaluation Metrics”. In: arXiv preprint arXiv:2106.03706 (2021). [168] Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. “ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning”. In: arXiv preprint arXiv:2002.04326 (2020). [169] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018. [170] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. “Personalizing Dialogue Agents: I have a dog, do you have pets too?” In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2204–2213. 128 [171] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. “Bertscore: Evaluating text generation with bert”. In: arXiv preprint arXiv:1904.09675 (2019). [172] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. “DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, July 2020, pp. 270–278. doi: 10.18653/v1/2020.acl-demos.30. [173] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan. “DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020, pp. 270–278. [174] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. “Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017, pp. 654–664. [175] Xueliang Zhao, Wei Wu, Can Xu, Chongyang Tao, Dongyan Zhao, and Rui Yan. “Knowledge-Grounded Dialogue Generation with Pre-trained Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 3377–3390. doi: 10.18653/v1/2020.emnlp-main.272. [176] Chujie Zheng and Minlie Huang. “Exploring Prompt-based Few-shot Learning for Grounded Dialog Generation”. In: arXiv preprint arXiv:2109.06513 (2021). [177] Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. “Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models”. In: arXiv preprint arXiv:2310.06117 (2023). [178] Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. “Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections”. In: arXiv preprint arXiv:2104.04670 (2021). [179] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, et al. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models”. In: The Eleventh International Conference on Learning Representations. 2022. [180] Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. “Commonsense knowledge aware conversation generation with graph attention.” In: IJCAI. 2018, pp. 4623–4629. [181] Pei Zhou, Hyundong J. Cho, Pegah Jandaghi, Dong-Ho Lee, Bill Yuchen Lin, Jay Pujara, and Xiang Ren. “Reflect Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. 129 [182] Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur. “Commonsense-Focused Dialogues for Response Generation: An Empirical Study”. In: Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue. Singapore and Online: Association for Computational Linguistics, 2021, pp. 121–132. url: https://aclanthology.org/2021.sigdial-1.13. [183] Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur. “Think Before You Speak: Using Self-talk to Generate Implicit Commonsense Knowledge for Response Generation”. In: Proceedings of the 60th annual meeting of the Association for Computational Linguistics. 2022. [184] Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, and Xiang Ren. “RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms”. In: arXiv preprint arXiv:2005.00782 (2020). url: https://arxiv.org/abs/2005.00782. [185] Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, et al. “How FaR Are Large Language Models From Agents with Theory-of-Mind?” In: arXiv preprint arXiv:2310.03051 (2023). [186] Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed H Chi, Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. “Self-discover: Large language models self-compose reasoning structures”. In: arXiv preprint arXiv:2402.03620 (2024). [187] Pei Zhou, Andrew Zhu, Jennifer Hu, Jay Pujara, Xiang Ren, Chris Callison-Burch, Yejin Choi, and Prithviraj Ammanabrolu. “I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons”. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 11136–11155. doi: 10.18653/v1/2023.acl-long.624. [188] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. “Large Language Models are Human-Level Prompt Engineers”. In: The Eleventh International Conference on Learning Representations. 2022. [189] Hao Zhu, Graham Neubig, and Yonatan Bisk. “Few-shot language coordination by modeling theory of mind”. In: International Conference on Machine Learning. PMLR. 2021, pp. 12901–12911. 130
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Grounding language in images and videos
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Identifying and mitigating safety risks in language models
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
A framework for research in human-agent negotiation
PDF
Multimodal reasoning of visual information and natural language
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Learning controllable data generation for scalable model training
PDF
Aggregating symbols for language models
PDF
Robust and generalizable knowledge acquisition from text
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Modeling social causality and social judgment in multi-agent interactions
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
Asset Metadata
Creator
Zhou, Pei
(author)
Core Title
Common ground reasoning for communicative agents
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
07/01/2024
Defense Date
04/24/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI agents,artificial intelligence,Communication,language models,machine learning,natural language processing,reasoning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pujara, Jay (
committee chair
), Ren, Xiang (
committee chair
), Mintz, Toby (
committee member
), Zhao, Jieyu (
committee member
)
Creator Email
peiz@usc.edu,peizhou57@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113997AJA
Unique identifier
UC113997AJA
Identifier
etd-ZhouPei-13175.pdf (filename)
Legacy Identifier
etd-ZhouPei-13175
Document Type
Dissertation
Format
theses (aat)
Rights
Zhou, Pei
Internet Media Type
application/pdf
Type
texts
Source
20240701-usctheses-batch-1176
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
AI agents
artificial intelligence
language models
machine learning
natural language processing
reasoning