Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic evaluation of open-domain dialogue systems
(USC Thesis Other)
Automatic evaluation of open-domain dialogue systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Automatic Evaluation of Open-Domain Dialogue Systems by Sarik Ghazarian A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2023 Copyright 2023 Sarik Ghazarian I dedicate my dissertation to MY BEAUTIFUL DAUGHTER ELENI, who made me the happiest person in the world with her precious existence MY HUSBAND, who always overwhelms my world with his endless love and positive energy MY FAMILY, who I am grateful to be surrounded by their constant support and love MY LOVELY SISTER IN HEAVEN, who I am sure watches me from above and is truly happy about my achievements ii Acknowledgments I would like to express my deepest appreciation to my advisor, Professor Aram Galstyan, whose constructive feedback and intelligent advice have led me throughout my PhD journey and made it an unforgettable chapter in my life. His faith in me encouraged me to follow my interests and conduct research on natural language processing as my favorite field of study. Secondly, I would love to extend my sincere gratitude to my co-advisor, Professor Nanyun Peng, as this endeavor would not have been possible without her thorough support and guidance. She is my role model, whose indescribable passion for research taught me to persevere and strive for success. I would like to extend my intimate thanks to my sympathetic and supportive friend, Ninareh Mehrabi, who has influenced me to do my utmost. I also want to thank all my collaborators who helped me to do impactful work: Ralph Weischedel, Yijia Shao, Rujun Han, Xuezhe Ma, Sahil Garg, Nuan Wen, Zixi Liu, Tuhin Chakrabarty, and Johnny Tian-Zheng Wei. Moreover, I am grateful to have many true friends whom I always have had the pleasure of talking to and sharing my ideas with: Sidi Liu, Shushan Arakelian, Mozhdeh Gheini, Mingyu Derek Ma, Te-Lin Wu, I-Hung Hsu. I state my special thanks to the committee members, Professors: Aram Galstyan, Nanyun Peng, Kallirroi Georgila, Morteza Dehghani, and Gaurav Sukhatme. In addition, I also thank my mentors and managers at Amazon, especially Behnam Hedayatnia and Yang Liu, who made my internships full of remarkable research experiences. I want to profoundly thank my caring and loving husband, Sevak Arakelian, who has been beside me throughout my years of study and has dedicated himself to my progress. Last but not least, I truly appreciate my family for always surrounding me with their persistent love and encouraging me to keep pursuing my goals. iii TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter2: RelevanceEvaluationofOpen-DomainDialogueSystems . . . . . . . . 4 2.1 Human Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Overlap-based Automatic Evaluations Metrics . . . . . . . . . . . . . . . . . . . . 5 2.3 Trainable Automatic Evaluations Metrics . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Referenced and Unreferenced Relevance Evaluators . . . . . . . . . . . . . . . . . 5 2.5 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1 Unreferenced Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5.1.2 Sentence Representation . . . . . . . . . . . . . . . . . . . . . . 7 2.5.1.3 MLP Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5.2 Referenced Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6.1 Generated responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6.2 Human Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7.1 Static Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7.2 Contextualized Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7.3 Training Unreferenced model . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.8.1 Unreferenced Metrics Results . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.8.2 Unreferenced + Referenced Metrics Results . . . . . . . . . . . . . . . . . . 12 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 iv Chapter3: EngagementEvaluationofOpen-DomainDialogueSystems . . . . . . . 13 3.1 Engagement Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Analysis of Engagement Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Conversation-level Engagement Scores . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Utterance-level Engagement Scores . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Utterance-level and Conversation-level Engagement Scores . . . . . . . . 15 3.3 Engagement Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.2.1 ConvAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.2.2 Daily Dialogue Dataset . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.4.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.4.2 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4.4.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter4: CoherenceEvaluationofOpen-DomainDialogueSystems . . . . . . . . 26 4.1 Different Perturbations for Coherence Evaluation . . . . . . . . . . . . . . . . . . 26 4.2 DEAM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Incoherent Dialogue Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Baselines Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2 AMR-based Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2.1 Contradiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.2.2 Coreference Inconsistency . . . . . . . . . . . . . . . . . . . . . 30 4.3.2.3 Irrelevancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2.4 Decrease Engagement . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4.1.1 Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4.1.2 Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 Metrics Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.2 Manipulations Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5.5 Manipulation Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 v Chapter5: CommonsenseEvaluationofOpen-DomainDialogueSystems . . . . . 42 5.1 Event Commonsense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.1 Symbolic Intermediate Representation . . . . . . . . . . . . . . . . . . . . 43 5.2.2 Event-Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.3 Compatibility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.4 Dialogue Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.5 Tuple Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.6 Commonsense Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.7 Additional Human-Machine Dialogues . . . . . . . . . . . . . . . . . . . . 48 5.2.8 Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.9 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.10 ACCENT Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.11 Metrics Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.12 Tuple Extraction Performance . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.13 Compatibility Test Performance . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.14 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.15 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter6: ControllableDialogueGenerationthroughAutomaticEvaluationMetrics 57 6.1 Controllable Dialogue Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 Constraints in Dialogue LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2.3 Automatic Evaluation Feedback . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter7: RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2 Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.4 Commonsense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.4.1 Joint Event-Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . 68 7.4.2 CSKB Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.5 Controllable Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter8: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 vi ListofTables 2.1 Example of the inappropriateness of overlap-based metrics. BLEU metric assigns a zero score to the acceptable generated response because it does not find any overlaps between the generated and reference responses. . . . . . . . . . . . . . . 5 2.2 Examples of query-response pairs, each rated by three AMT workers with scores from 1 (not appropriate response) to 5 (completely appropriate response). . . . . . 9 2.3 Correlations and similarity values between relatedness scores predicted by different unreferenced models and human judgments. First row is RUBER’s unreferenced model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Correlation and similarity values between automatic evaluation metrics (combination of Referenced and Unreferenced metrics) and human annotations for 300 query-response pairs annotated by AMT workers. The "Pooling" column shows the combination type of referenced and unreferenced metrics. . . . . . . . 12 3.1 Data statistics of the ConvAI evaluation dataset. The first row shows conversations with their corresponding engagement scores extracted from the original ConvAI dataset; the second row contains the number of utterances and their engagement scores automatically assigned by our heuristics. . . . . . . . . . . . . . . . . . . . 15 3.2 The results for the Amazon Mechanical Turk (AMT) experiments on utterance- level engagement. 49 annotators annotated 297 utterances and demonstrated quite high inter-annotator Kappa agreement and Pearson correlation between annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 The Pearson correlation between engagement scores of 50 randomly selected conversations from ConvAI and the aggregated engagement scores of their utterances annotated by AMT workers with different aggregation methods. . . . . 16 3.4 ConvAI train/valid/test sets of utterances with their engagement score labels . . . 20 3.5 Pearson and Spearman correlations between human judgements and several automatic dialogue evaluation metrics on generated responses, human-written responses, and their mixture. We adopt the mean aggregation of the relevance score of contextualized RUBER mentioned as Ctx_RUBER and the predictive utterance-level engagement scores. The first two rows in each group show correlations between human judgements and baseline models with only relevance scores, the middle two rows are for only engagement scores and the last two rows add engagement scores into relevance scores. Boldface indicates the improvements are significant compared to the baseline in the corresponding group (p <.05). . . 24 vii 3.6 Examples of evaluation scores for utterances from the Daily Dialogue Dataset. We used unreferenced score of Contextualized RUBER as relevance score and MLP BERT(mean) as engagement score, where the numbers are rounded into 2 digits. The incorporation of engagement scores into relevance scores yields scores closer to human judgements – the main goal of automatic evaluation metrics. . . . . . . 25 4.1 Statistics (size, average length of conversations and utterances) of TopicalChat and PersonaChat train/valid datasets (augmented with AMR-based manipulated conversations), alongside with FED and DSTC9 test datasets. . . . . . . . . . . . . 33 4.2 Spearman Correlations of different models with human judgements trained on TopicalChat andPersonaChat datasets following different manipulations for negative sample generation (setup 1). . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Spearman Correlations of the same RoBERTa-large models finetuned on TopicalChat and PersonaChat datasets augmented with incoherent conversations generated by different manipulation techniques (setup 3). . . . . . . 35 4.4 Pairwise comparisons between DEAM with the proposed AMR-based manipula- tions and different baseline models using their original datasets, manipulations, and models (setup 2). All models have been trained on balanced sets of coherent/incoherent examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Correlations of DEAM with human judgments trained on different ablated manipulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Examples of incoherent conversations resulted from different applied manipulation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Examples of original sentences, their linearized AMR graphs and back-translated sentences indicated with green, black and blue colors respectively. Bold parts of the sentences demonstrate the syntactical changes resulted from AMRs that usually do not cover such information. . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 Semantic meanings and designed prompts for the selected ATOMIC 20 20 relations. The semantic meanings are from Hwang et al. (2021). . . . . . . . . . . . . . . . . 45 5.2 Pearson (𝛾) and Spearman (𝜌) correlations between human judgments and different automatic evaluation metrics. The results for ACCENT are all significant (𝑝 < 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Examples of evaluation scores given by ACCENT from the DECO test set. Only 𝑢 𝑛−1 in the dialogue history is shown, and the human annotations are normalized to the range of(0,1) for better demonstration. “Extracted Tuple” column shows tuples extracted in ACCENT framework which account for the given scores. . . . 50 5.4 Performances of different event-relation extraction methods on DECO test set. P: Precision. R: Recall. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.5 Test results on the CSKB compatibility benchmark. We report the overall AUC across all relations (“All”) and the AUC across samples with our target relations (“Subset”). Both the averaged metric and its standard deviation are reported over 3 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 viii 5.6 Ablation results measured by Spearman correlation. I: Ablation of the event- relation extraction part. The gray row shows the results using human extracted tuples which provides an upper bound. II: Ablation of the compatibility test part of ACCENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.7 Results with different sentence embedding methods measured by Spearman correlation. Following Gao et al. (2021), we use the average results on the semantic textual similarity (STS) tasks to reflect the sentence embedding performance. . . . 53 5.8 Correlations between human judgments and different compatibility test approaches with human-extracted tuples on DECO test set. . . . . . . . . . . . . . 54 5.9 Examples of results given by different compatibility test approaches. Only 𝑢 𝑛−1 in the dialogue history is shown. Human annotations are normalized to the range of (0, 1) and scores given by “COMET (neural)” are scaled up by 100 times for better demonstration. Compatibility scores for each tuple are averaged to get the final score for the target response (inbold font). . . . . . . . . . . . . . . . . . . . . . . 54 5.10 Some examples of the automatically extracted event-relation tuples. These tuples function as the intermediate symbolic representation in ACCENT framework. . . 56 6.1 Examples of responses generated by Vicuna, Alpaca, and FastChat-T5 models alongside their type of issues indicated by majority of annotators. . . . . . . . . . 60 6.2 Pearson Correlations of pretrained (Pre) and finetuned (Ft) automatic evaluation metrics on the test set of DETA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Pairs of responses generated by Alpaca and CDAIF(rel), CDAIF(eng) or CDAIF(rel_eng) models. The winner responses are the preferred ones by the majority of annotators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 ix ListofFigures 2.1 An illustration of changes applied to RUBER’s unreferenced metric’s architecture. Red dotted double arrows show three main changes. The leftmost section is related to substituting word2vec embeddings with BERT embeddings. The middle section replaces Bi-RNNs with simple pooling strategies to get sentence representations. The rightmost section switches ranking loss function to MLP classifier with cross entropy loss function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 BERT-based referenced metric. Static word2vec embeddings are replaced with BERT embeddings (red dotted section). . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 An illustrative example of a user’s conversation with two chatbots. We anticipate that the user will prefer to converse with the top chatbot because the responses are both relevant and engaging, while the bottom chatbot generates relevant but not engaging responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 The left scatterplot depicts the correlation between the ground-truth conversation- level engagement scores and the mean aggregation of engagement scores of utterances for 50 conversations conducted in AMT experiment. The Pearson correlation value is 0.85. The right scatterplot depicts the correlation between the engagement scores of 297 utterances annotated by human in AMT experiment and heuristically assigned conversation-level engagement score to all utterances in the conversation. The Pearson correlation value is 0.60. . . . . . . . . . . . . . . 17 3.3 An illustration of the proposed utterance-level engagement classifier. . . . . . . . 18 3.4 Balanced accuracy (ROC AUC) of different utterance-level engagement classifiers on the ConvAI test set (the last row in Table 3.4) and human annotated test set (Table 3.2). The first two groups of bars show SVM and MLP classifier performance based on word2vec embeddings; the remaining bars are our proposed classifiers based on BERT embeddings with mean and max pooling strategies. . . . . . . . . 20 3.5 Pearson and Spearman correlations between predictive engagement scores and human judgements for generated and human-written responses. . . . . . . . . . . 21 4.1 Examples of human-human (first dialogue) and human-chatbot (second dialogue) conversations alongside manipulations resulted from baseline (indicated with yellow color) and our proposed perturbations (indicated with purple color), respectively. Similar to the human-chatbot interaction, our manipulations result in more subtly-incoherent dialogue compared to baseline manipulations. . . . . . 27 x 4.2 Overall architecture of DEAM metric trained on positive (green box) interactions and negative (red box) conversations generated from AMR-based manipulations (orange box) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 AMR representation of a dialogue utterance . . . . . . . . . . . . . . . . . . . . . . 29 4.4 An abbreviated sample conversation to illustrate four different AMR-based DEAM manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Scatter plots and regression lines of different models predicted scores versus FED-coherence human evaluations. Overlapped points are represented darker. . . 37 4.6 Scatter plots and regression lines of different models predicted scores versus FED-overall human evaluations. Overlapped points are represented darker. . . . . 38 4.7 The accuracy of evaluation metrics to distinguish coherent/incoherent conversations in test data (y-axis) generated using baseline manipulations (x-axis). 39 4.8 Statistics of different types of logical flaws observed in 50 randomly selected low-quality interactions between human and state-of-the-art dialogue systems (25 dialogues from FED and 25 from DSTC9) . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Examples of nonsensical system responses in open-domain dialogues. . . . . . . 43 5.2 The overview of ACCENT. Given the target response and its dialogue history, ACCENT first extracts the event-relation tuples. Then, the compatibility test (detailed illustration in Figure 5.4) assigns a score to each tuple: ACCENT queries the dynamic CSKB, i.e., COMET, withℎ and𝑟, and generates𝑘 events. The compatible score is the maximum similarity between the ground-truth𝑡 and the 𝑘 generated events{𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 . Scores for all tuples in a response are averaged to obtain the event commonsense score for the target response. . . . . . . . . . . . . 44 5.3 Illustration for event-relation extraction. For each relation𝑟∈ ˜ R, we use its corresponding prompt to guide the model to generateℎ and𝑡. The final tuple is parsed from the generated output. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 An example of compatibility test. We query the dynamic CSKB withℎ and𝑟, and the compatibility score is the maximum similarity between𝑡 and the generated tail events ({𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5 Relation distribution in DECO test set. . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6 Average event commonsense scores of generated responses of different models using human annotations (scale 1 to 5) and ACCENT automatic evaluation (scale 0 to 1). The rankings of systems given by human and ACCENT are the same. . . . 51 6.1 Generated responses by LLM versus our proposed controllable generative model (CDAIF) that results in a more relevant and engaging response by altering the LLM’s original probability distribution for the highlighted token taking into account signals from automatic evaluation metrics as the control feedback. . . . . 58 6.2 The percentage of responses with/without issue generated by Vicuna (left), Alpaca (middle), and FastChat-T5 (right) models. . . . . . . . . . . . . . . . . . . . . . . . 60 6.3 Percentage of different type of issues existing in the responses generated by Vicuna, Alpaca, and FastChat-T5 models. . . . . . . . . . . . . . . . . . . . . . . . 61 xi 6.4 The percentage of generated responses by Alpaca and CDAIF models controlled by engagement, relevance and both constraints that have higher and/or equal quality. 63 xii Abstract With the rapid development of open-domain dialogue systems in recent years, it is imperative to have precise evaluation metrics that correctly assess the quality of these systems. To this end, many researchers resort primarily to human evaluation, which is time-consuming, expensive and does not facilitate the model comparisons across research papers. Therefore, the existence of accurate automatic evaluation metrics that can accelerate the development cycle by assisting the process of architecture search and hyperparameter tuning is necessary. Reference-based metrics such as BLEU or ROUGE fail to correlate well with human judgment in open-domain settings as there can be potentially many plausible generations that do not overlap significantly with the limited set of given references. This failure leads the research towards learning-based evaluation metrics that are more sophisticated and reliable. Automatic evaluation of open-domain dialogue systems has a multifaceted nature with many fine-grained quality aspects. This dissertation explores both turn-level and conversation-level facets of open-domain dialogue evaluation. We train models that automatically assess the relevance, engagement, coherence, and commonsense aspects of the responses generated by dialogue models. We formulate the evaluation as a classification task to identify the quality of the responses. To this end, we focus on training data and model architecture of these metrics as two main components that the metrics’ quality strongly relies on them. We start with heuristic text-level manipulations such as random swapping of utterances to create negative samples for training evaluation metrics. Then, we show that such manipulations are insufficient to appropriately reflect the issues that occur in interactions between advanced dialogue models and human. To tackle this issue, we move forward toward proposing advanced semantic-level perturbations of human-written responses to generate challenging negative responses that are more likely to be generated by state-of-the-art dialogue models. Next, we complete our investigation on dialogue evaluation by concentrating on the model architecture of these metrics. To this end, we incorporate knowledge from knowledge bases and leverage prompt-based generative models in a low-resource setting. Finally, in addition to dialogue assessment, the main goal of automatic evaluation metrics, we leverage them as influential control factors to guide dialogue models and generate higher-quality responses. xiii Chapter1 Introduction Recent advance in open-domain dialogue systems has enlarged the importance of automatic evaluation metrics. Human evaluation is not the best lowest-cost choice of assessment anymore as the large number of human annotations needed for hyper-parameter tuning or architecture searches of dialogue models makes it an infeasible choice (Britz et al., 2017; Melis et al., 2018). Therefore, automatic evaluation metrics are essential to provide human-like reliable assessment and also save time and effort. Traditional reference-based metrics such as BLEU (Papineni et al., 2002a) and ROUGE (Lin, 2004) are also ineffective in open-domain setting because there could be many valid responses for a given context that share no common words with reference responses (Liu et al., 2016). To overcome this limitation, recent research has explored the feasibility of trainable evaluation metrics capable of assessing the quality of generations. These metrics are essentially based on classifiers trained to distinguish between high and low quality dialogue responses. Due to the sophisticated nature of conversations, the quality factors in dialogue evaluation are classified in a hierarchy of turn-level and conversation-level aspects. In this dissertation, we consider both levels of evaluation, showing their challenges in achieving reliable evaluation metrics. Relevance between utterances is one of the basic features that any conversation between two interlocutors should contain for meaningful and continuous interaction. We define the relevance metric as a turn-level evaluator that measures the relatedness of a response to the previous query. Relevance is also required for the conversation where the overall set of query-response pairs must have coherent exchanges. This aspect is called coherence. Although relevance is a desired attribute, we show it is insufficient as it can not adequately capture all the characteristics of open-domain dialogue systems. Generated responses should be relevant and interesting to motivate the user to continue the interaction. Finally, this dissertation also discusses the commonsense reasoning aspect of dialogue responses. Commonsense reasoning is omnipresent in human communications and, consequently, an important feature in evaluating open-domain dialogue systems. Despite its significance, no prior research has focused on commonsense evaluation in dialogue systems. In this dissertation, we propose an event commonsense evaluation metric empowered by commonsense knowledge bases that extracts event-relation tuples from a dialogue and evaluates them in terms of their compatibility with the commonsense knowledge bases. 1 Automatic evaluation is usually defined as a classification task that learns to classify responses based on their examined quality aspects. Similar to other classification tasks, two main components that directly affect its performance are data and model architecture. • Automatic evaluation metrics must be trained on data containing high and low quality responses. These train data should be close to the generated responses by various state- of-the-art dialogue models that are under evaluation. Therefore, preparing a high-quality training dataset is a momentous task. Providing positive responses that meet all the quality constraints is a straightforward task as they are directly coming from human-human con- versations. The main challenge is regarding negative samples that are bootstrapped from manipulations applied to human-written responses. The goal of these manipulations is to mimic possible system drawbacks that generative models could produce. Therefore, the first main question that this dissertation focuses on ishow tocreatenegative examples thatcan adequately reflect different types of errors by dialogue systems and thus result in more natural and challenging low-quality samples. • Model: The architecture of automatic evaluation metrics is defined based on the evaluation aspect, whether the assessment is done on a turn or conversation level. We investigate both contrastive supervised learning-based models that are designed to differentiate positive and negative inputs and few-shot prompt-based learning models. Accordingly, in this dissertation the second question that we address iswhatisthebestmodelarchitectureforthe automatic evaluation metric considering the type of assessment aspect? Besides the quality assessment as the major application of automatic dialogue evaluation metrics, there can be another reverse purpose: providing signals and feedback to control the generation. Even recently proposed large language models (Zhang et al., 2020c; Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023; Chung et al., 2022) that have engendered stunning revolution in natural language generation including dialogue systems face issue to conduct conversations that are well correlated with human preferences. In this dissertation, we study the usage of automatic dialogue evaluation metrics as reliable and cost-efficient resources to execute the imperative task of controlling LLM generations from different perspectives such as relevance, engagingness, etc. To summarize, in this dissertation, we address the problem of automatic dialogue evaluation, and propose methods for training better evaluation metrics that can be reliable replacements for human judgments to facilitate comparison and assessment of dialogue models. We present our findings in the following order: In Chapter 2, we introduce a trainable evaluation metric to measure relevance as an essential turn-level feature for dialogue systems. We train this metric in a contrastive learning setup by following simple random matching perturbations to create negative samples and investigate the effectiveness of transformer-based models in representing model inputs. In Chapter 3, we show the inadequacy of the relevance metric in measuring the overall quality of dialogue generations. We propose a turn-level evaluator for engagement and incorporate the predicted engagement scores into relevance estimations to get higher-quality evaluations. In Chapter 4, we discuss more complex semantic-level approaches of creating negative samples by 2 perturbing abstract-level human-written examples to train evaluation metrics and measure the coherence of the entire conversation. In Chapter 5, we discuss commonsense as another important yet neglected feature for dialogue evaluation. Our proposed metric extracts event-relation tuples from utterances and computes their compatibility score using commonsense knowledge bases. Chapter 6 is about controllable dialogue generation task for controlling the generation of large language models. We leverage our proposed automatic evaluation metrics and guide dialogue models to generate more relevant and engaging responses. Chapter 7 contains previous work around evaluation metrics on different features and approaches designed for reaching more accurate evaluations. We conclude by discussing potential ideas for future work in Chapter 8. 3 Chapter2 RelevanceEvaluationofOpen-DomainDialogueSystems The fast development of open-domain dialogue systems (i.e. chatbots) makes their automatic evaluation task more challenging and crucial. Automatic evaluation metrics are mainly designed to provide efficient assessments of dialogue systems that make the comparison of dialogue models across research papers easier and speed up systems development. Due to the multi-faceted nature of dialogue evaluation, the quality of generations is assessed from various perspectives. In this chapter, we cover relevance as one of the essential features that causes the interaction between two interlocutors to continue. We propose a trainable model that learns to measure the relevance between a response and its given query. The training data includes relevant pairs coming from human-written conversations and heuristically constructed irrelevant examples. We incorporate contextualized embeddings of inputs using transformer-based models and explore their positive impact on the evaluation task. We examine different network architectures and objective functions to better utilize contextualized word embeddings for the evaluation task. 2.1 HumanEvaluations Evaluation is one of the main components in the development of machine learning models, including dialogue systems. Many researchers primarily resort to human evaluation for assessing their dialogue systems performances (Shang et al., 2015; Sordoni et al., 2015; Shao et al., 2017). They collect human annotations by asking them to rate the quality of generations taking into account different aspects. Human annotation has two main problems: 1) it is time-consuming and expensive, and 2) it does not facilitate comparisons across research papers. For certain research questions that involve hyper-parameter tuning or architecture searches, the amount of human annotation makes such studies infeasible (Britz et al., 2017; Melis et al., 2018). Therefore, developing reliable automatic evaluation metrics for open-domain dialog systems that can replace human annotation is imperative as it can save time, effort, and cost. 4 DialogueContext Speaker 1: Hey! What are you doing here? Speaker 2: I’m just shopping. Query: What are you shopping for? GeneratedResponse: Some new clothes. ReferenceResponse: I want buy gift for my mom! Table 2.1: Example of the inappropriateness of overlap-based metrics. BLEU metric assigns a zero score to the acceptable generated response because it does not find any overlaps between the generated and reference responses. 2.2 Overlap-basedAutomaticEvaluationsMetrics Metrics based on𝑛-gram overlaps such as BLEU (Papineni et al., 2002a) and ROUGE (Lin, 2004), initially designed for evaluating machine translation and summarization, have been adopted to evaluate dialogue systems (Sordoni et al., 2015; Li et al., 2016; Su et al., 2018). However, Liu et al. (2016) found a weak segment-level correlation between these metrics and human judgments of response quality. As shown in Table 2.1, high-quality responses can have low or even no𝑛-gram overlap with a reference response, indicating that these metrics are not suitable for dialogue evaluation (Novikova et al., 2017; Lowe et al., 2017). 2.3 TrainableAutomaticEvaluationsMetrics The main drawback of mentioned𝑛-gram overlap metrics, which makes them inapplicable in dialogue system evaluation is that they don’t consider the semantic similarity between sentences (Liu et al., 2016; Novikova et al., 2017; Lowe et al., 2017). These word overlapping metrics are not compatible with the nature of language generation, which allows a concept to be appeared in different sentences with no common 𝑛-grams, while they all share the same meaning. Trainable evaluation metrics are better choices that do not have the drawbacks of heuristic overlap-based metrics (Lowe et al., 2017; Tao et al., 2018a). These metrics are neural models trained on a training dataset and are learned to distinguish between responses and assign them quality scores. The evaluation of dialogue systems has multiple aspects. In this Chapter, we focus on relevance by proposing a trainable metric that is capable of measuring how much each response is relevant to its given query. 2.4 ReferencedandUnreferencedRelevanceEvaluators Trainable automatic evaluation metrics can estimate the relevance of responses to the dialogue history with and without considering the reference ground-truth responses. The Referenced 5 metric and Unreferenced metric Blended Evaluation Routine (RUBER) (Tao et al., 2018a) stands out from recent work in automatic dialogue evaluation which combines the scores from referenced and unreferenced metrics by relying minimally on human-annotated datasets of response quality for training. RUBER evaluates responses with a blending of scores from two metrics: Unreferencedmetric which computes the relevancy of a response to a given query inspired by (Grice, 1975)’s theory that the quality of a response is determined by its relatedness and appropriateness, among other properties. This model is trained with negative sampling. Referenced metric which determines the similarities between generated and reference re- sponses using word embeddings. Both metrics strongly depend on learned word embeddings. We propose to explore the use of contextualized embeddings, specifically BERT embeddings (Devlin et al., 2018), in composing evaluation metrics as these embeddings have shown their positive impact in many natural language processing tasks (Devlin et al., 2018; Lewis et al., 2020). 2.5 ProposedModels We conduct the research under the RUBER metric’s referenced and unreferenced framework, where we replace their static word embeddings with pretrained BERT contextualized embeddings and compare the performances. We identify three points of variation with two options each in the unreferenced component of RUBER. The main changes are in the word embeddings, sentence representation, and training objectives (Ghazarian et al., 2019a) that will be explained with details in the following sections. Our experiment follows a 2x2x2 factorial design. 2.5.1 UnreferencedMetric The unreferenced metric predicts how much a generated response is related to a given query. Figure 2.1 presents RUBER’s unreferenced metric overlaid with our proposed changes in three parts of the architecture. Changes are illustrated by red dotted double arrows and include word embeddings, sentence representation and the loss function. 2.5.1.1 WordEmbeddings Static and contextualized embeddings are two different types of word embeddings that we explore. • Word2vec. Recent works on learnable evaluation metrics use simple word embeddings such as word2vec and GLoVe as input to their models (Tao et al., 2018a; Lowe et al., 2017; Kannan and Vinyals, 2017). Since these static embeddings have a fixed context-independent representation for each word, they cannot represent the rich semantics of words in contexts. • BERT. Contextualized word embeddings are recently shown to be beneficial in many NLP tasks (Devlin et al., 2018; Radford et al., 2018; Peters et al., 2018; Liu et al., 2019). A noticeable contextualized word embeddings, BERT (Devlin et al., 2018), is shown to perform competitively among other contextualized embeddings, thus we explore the effect of BERT 6 Ranking Loss Bi-RNN Bert Embeddings Pooling Pooling 0 1 MLP Classifier What are you shopping for Some new clothes ? . Word2vec Embeddings M Figure 2.1: An illustration of changes applied to RUBER’s unreferenced metric’s architecture. Red dotted double arrows show three main changes. The leftmost section is related to substituting word2vec embeddings with BERT embeddings. The middle section replaces Bi-RNNs with simple pooling strategies to get sentence representations. The rightmost section switches ranking loss function to MLP classifier with cross entropy loss function. embeddings on open domain dialogue systems evaluation task. Specifically, we substitute the word2vec embeddings with BERT embeddings in RUBER’s unreferenced score as shown in the leftmost section of Figure 2.1. 2.5.1.2 SentenceRepresentation This section composes a single vector representation for both a query and a response. • Bi-RNN. In the RUBER model, Bidirectional Recurrent Neural Networks (Bi-RNNs) are trained for this purpose. • Pooling. We explore the effect of replacing Bi-RNNs with some simple pooling strategies on top of words BERT embeddings (middle dotted section in Figure 2.1). The intuition behind this is that BERT embeddings are pre-trained on bidirectional transformers and they include complete information about word’s context, therefore, another layer of bi-RNNs could just blow up the number of parameters with no real gains. 2.5.1.3 MLPNetwork Multilayer Perceptron Network (MLP) is the last section of RUBER’s unreferenced model that is trained by applying negative sampling technique to add some random responses for each query into training dataset. 7 Cosine similarity Pooling Pooling Bert Embeddings I want buy gift for Some new clothes my mom . . Word2vec Embeddings Figure 2.2: BERT-based referenced metric. Static word2vec embeddings are replaced with BERT embeddings (red dotted section). • Ranking loss. The objective is to maximize the difference between relatedness score predicted for positive and randomly added pairs. We refer to this objective function as a ranking loss function. The sigmoid function used in the last layer of MLP assigns a score to each pair of query and response, which indicates how much the response is related to a given query. • Crossentropyloss. We explore the efficiency of using a simpler loss function such as cross entropy. In fact, we consider unreferenced score prediction as a binary classification problem and replace baseline trained MLP with MLP classifier (right dotted section in Figure 2.1). Since we do not have a human labeled dataset, we use negative sampling strategy to add randomly selected responses to queries in training dataset. We assign label 1 to original pairs of queries and responses and 0 to the negative samples. The output of softmax function in the last layer of MLP classifier indicates the relatedness score for each pair of query and response. 2.5.2 ReferencedMetric The referenced metric computes the similarity between generated and reference responses. RUBER achieves this by applying pooling strategies on static word embeddings to get sentence embeddings for both generated and reference responses. In our metric, we replace the word2vec embeddings with BERT embeddings (red dotted section in Figure 2.2) to explore the effect of contextualized embeddings on calculating the referenced score. We refer to this metric as BERT-based referenced metric. 8 Query Response Rating Can I try this one on? Yes, of course. 5, 5, 5 This is the Bell Captain’s Desk. May I help you? No, it was nothing to leave. 1, 2, 1 Do you have some experiences to share with me? I want to have a try. Actually, it good to say. Thanks a lot. 3, 2, 2 Table 2.2: Examples of query-response pairs, each rated by three AMT workers with scores from 1 (not appropriate response) to 5 (completely appropriate response). 2.6 Dataset We use the DailyDialog dataset 1 which contains high quality multi-turn conversations about daily life including various topics (Li et al., 2017a), to train our dialogue system as well as the evaluation metrics. This dataset includes almost 13k multi-turn dialogues between two parties splitted into 42,000/3,700/3,900 query-response pairs for train/test/validation sets. We divide these sets into two parts, the first part for training dialogue system and the second part for training unreferneced metric. In order to train unreferneced metric, we do random matching of queries and responses as a simple heuristic approach to construct irrelevant pairs. 2.6.1 Generatedresponses We use the first part of train/test/validation sets with overall 20,000/1,900/1,800 query-response pairs to train an attention-based sequence-to-sequence (seq2seq) model (Bahdanau et al., 2014) and generate responses for evaluation. We use OpenNMT (Klein et al., 2017) toolkit to train the model. The encoder and decoder are Bi-LSTMs with 2 layers each containing 500-dimensional hidden units. We use 300-dimensional pretrained word2vec embeddings as our word embeddings. The model is trained by using SGD optimizer with learning rate of 1. We use random sample with temperature control and set temperature value to 0.01 empirically to get grammatical and diverse responses. 2.6.2 HumanJudgments We collect human annotations on generated responses in order to compute the accuracy of our proposed evaluators by calculating the correlation between human judgments and the metrics. Human annotations are collected from Amazon Mechanical Turk (AMT). AMT workers are provided a set of query-response pairs and asked to rate each pair based on the appropriateness of the response for the given query on a scale of 1-5 (not appropriate to very appropriate). Each survey includes 5 query-response pairs with an extra pair for attention checking. We remove all 1 http://yanran.li/dailydialog 9 pairs that were rated by workers who failed to correctly answer attention-check tests. Each pair is annotated by 3 individual turkers. Table 2.2 demonstrates three query-response pairs rated by three AMT workers. In total 300 utterance pairs are rated from contributions of 106 unique workers. 2.7 ExperimentalSetup 2.7.1 StaticEmbeddings To compare how the word embeddings affect the evaluation metric, which is the main focus of this paper, we used word2vec as static embedddings trained on about 100 billion words of Google News Corpus. These 300 dimensional word embeddings include almost 3 million words and phrases. We applied these pretrained embeddings as input to dialogue generation, referenced and unreferenced metrics. 2.7.2 ContextualizedEmbeddings In order to explore the effects of contextualized embedding on evaluation metrics, we used the BERT base model with 768 vector dimensions pretrained on Books Corpus and English Wikipedia with 3,300M words (Devlin et al., 2018). 2.7.3 TrainingUnreferencedmodel We used the second part of the DailyDialog dataset composed of 22,000/1,800/2,100 train/test/validation pairs to train and tune the unreferenced model, which is implemented with Tensorflow. For sen- tence encoder, we used 2 layers of bidirectional gated recurrent unit (Bi-GRU) with 128-dimensional hidden unit. We used three layers for MLP with 256, 512 and 128-dimensional hidden units and tanh as activation function for computing both ranking loss and cross-entropy loss. We used Adam (Kingma and Ba, 2015) optimizer with initial learning rate of 10 −4 and applied learning rate decay when no improvement was observed on validation data for five consecutive epochs. We applied early stop mechanism and stopped training process after observing 20 epochs with no reduction in loss value. 2.8 Results We first present the unreferenced metrics’ performances. Then, we present results on the full RU- BER’s framework - combining unreferenced and referenced metrics. To evaluate the performance of our metrics, we calculated the Pearson and Spearman correlations between learned metric scores and human judgments on 300 query-response pairs collected from AMT. The Pearson coefficient measures a linear correlation between two ordinal variables, while the Spearman coefficient measures any monotonic relationship. The third metric we used to evaluate our metric 10 Embedding Represent. Objective Pearson (p-value) Spearman (p-value) Cosine Similarity word2vec Bi-RNN Ranking 0.28 (<6e-7) 0.30 (<8e-8) 0.56 Cross-Entropy 0.22 (<9e-5) 0.25 (<9e-6) 0.53 Max Pooling Ranking 0.19 (<8e-4) 0.18(<1e-3) 0.50 Cross-Entropy 0.25 (<2e-5) 0.25(<2e-5) 0.53 Mean Pooling Ranking 0.16 (<5e-3) 0.18(<2e-3) 0.50 Cross-Entropy 0.04 (<5e-1) 0.06(<3e-1) 0.47 BERT Bi-RNN Ranking 0.38 (<1e-2) 0.31(<4e-8) 0.60 Cross-Entropy 0.29 (<2e-7) 0.24 (<3e-5) 0.55 Max Pooling Ranking 0.41 (<1e-2) 0.36 (<7e-9) 0.65 Cross-Entropy 0.55 (<1e-2) 0.45 (<1e-2) 0.70 Mean Pooling Ranking 0.34 (<2e-9) 0.27 (<2e-6) 0.57 Cross-Entropy 0.32 (<2e-8) 0.29 (<5e-7) 0.55 Table 2.3: Correlations and similarity values between relatedness scores predicted by different unreferenced models and human judgments. First row is RUBER’s unreferenced model. is cosine similarity, which computes how much the scores produced by learned metrics are similar to human scores. 2.8.1 UnreferencedMetricsResults This section analyzes the performance of unreferenced metrics which are trained based on various word embeddings, sentence representations and objective functions. The results in the upper section of Table 2.3 are all based on word2vec embeddings while the lower section are based on BERT embeddings. The first row of table 2.3 corresponds to RUBER’s unreferenced model and the five following rows are our exploration of different unreferenced models based on word2vec embeddings, for fair comparison with BERT embedding-based ones. Table 2.3 demonstrates that unreferenced metrics based on BERT embeddings have higher correlation and similarity with human scores. Contextualized embeddings have been found to carry richer information and the inclusion of these vectors in the unreferenced metric generally leads to better performance (Liu et al., 2019). Comparing different sentence encoding strategies (Bi-RNN v.s. Pooling) by keeping other variations constant, we observe that pooling of BERT embeddings yields better performance. This would be because of BERT embeddings are pretrained on deep bidirectional transformers and using pooling mechanisms is enough to assign rich representations to sentences. In contrast, the models 11 Model Unref. Ref. Pooling Pearson Spearman Cosine Similarity Embed. Representation Objective Embed. RUBER word2vec Bi-RNN Ranking word2vec min 0.08 (<0.16) 0.06 (<0.28) 0.51 max 0.19 (<1e-3) 0.23 (<4e-5) 0.60 mean 0.22 (<9e-5) 0.21 (<3e-4) 0.63 Ours BERT max Pooling Cross- Entropy BERT min 0.05 (<0.43) 0.09 (<0.13) 0.52 max 0.49 (<1e-2) 0.44 (<1e-2) 0.69 mean 0.45 (<1e-2) 0.34 (<1e-2) 0.70 Table 2.4: Correlation and similarity values between automatic evaluation metrics (combination of Referenced and Unreferenced metrics) and human annotations for 300 query-response pairs annotated by AMT workers. The "Pooling" column shows the combination type of referenced and unreferenced metrics. based on word2vec embeddings benefit from Bi-RNN based sentence encoder. Across settings, max pooling always outperforms mean pooling. Regarding the choice of objective functions, ranking loss generally performs better for models based on word2vec embeddings, while the best model with BERT embeddings is obtained by using cross-entropy loss. We consider this as an interesting observation and leave further investigation for future research. 2.8.2 Unreferenced+ReferencedMetricsResults This section analyzes the performance of integrating variants of unreferenced metrics into the full RUBER framework which is the combination of unreferenced and referenced metrics. We only considered the best unreferenced models from Table 2.3. As it is shown in Table 2.4, across different settings, max combinations of referenced and unereferenced metrics yields the best performance. We see that metrics based on BERT embeddings have higher correlations with human scores than RUBER (the first row of Table 2.4) which is based on word2vec embeddings. In comparison with purely unreferenced metrics (Table 2.3), correlations decreased across the board. This suggests that the addition of the referenced component is not beneficial, contradicting RUBER’s findings (Tao et al., 2018a). We hypothesize that this could be due to data and/or language differences, and leave further investigation for future work. 2.9 Conclusion We explored trainable automatic evaluation of relevance as a turn-level assessment feature. We applied contextualized word embeddings to the architecture of RUBER referenced and unreferenced relevance evaluators to represent input query-response pairs. The experiments showed that the unreferenced scores of RUBER metric improved and resulted in higher correlation with human annotations by considering contextualized word embeddings including richer representations of words and their context. 12 Chapter3 EngagementEvaluationofOpen-DomainDialogueSystems Relevance is one of the major features in open-domain dialogue systems, yet it is not the only aspect. In this Chapter, we argue that relevancy by itself can not capture all the characteristics of open-domain dialogue systems given their open-ended essence (Venkatesh et al., 2018; Guo et al., 2018; See et al., 2019a). For example, it is not informative to compare the output of the two dialogue systems depicted in Figure 3.1 based on only the relevancy of generated responses since both systems produce fairly relevant responses. Augmenting engagement scores would give a higher score to the first system, making the metric better aligned with expected user preferences. While engagement is recognized as one of the most important metrics for open-domain dialogue evaluation (See et al., 2019a; Venkatesh et al., 2018; Guo et al., 2018), efficient calculation of this metric poses several significant challenges that we address in this chapter. First, existing works focus on conversation-level engagement only, while immediate evaluation and feedback for each utterance will be more effective in providing signals to adjust the system as it proceeds in a dialogue. Second, the existing methods for evaluating engagement are mostly heuristic, which can be inaccurate and usually brittle for different domains. Third, there is no systematic prior study on how well-defined and measurable utterance-level engagement is, and whether engagement measured from a single utterance can be predictive of a conversation-level engagement. In this chapter, we propose a new proxy for measuring engagement that we call predictive engagement, which, in contrast to most previous heuristics measures, operates on the utterance- level and can be learned from the data. We incorporate predictive engagement into the proposed automatic relevance dialogue evaluation metric discussed in Chapter 2 to improve the correlation with human judgments. 3.1 EngagementMetrics Engagement is a substantial metric that shows user willingness to continue conversing with the system (Yu et al., 2004; Ma, 2018; Inoue et al., 2018) and has been studied in the context of dialogue systems (Yu et al., 2016; Zhang et al., 2018a; See et al., 2019a). Many researchers have considered engagement as a useful conversation-level metric. They estimate this aspect based on the number of turns, the total duration of conversation (Venkatesh et al., 2018; Khatri et al., 2018), ability to ask questions (Ghandeharioun et al., 2019) which are not adequate in defining the engagement feature. 13 R1: I love most kinds of sports but soccer is my favorite one. How often do you exercise? Q1:What is your favorite sport? Q2: Maybe once in a week. R2: Oh! you should spend more time on your health … Q1: What is your favorite sport? R1: Soccer Q2: Why? R2: Don’t know! Figure 3.1: An illustrative example of a user’s conversation with two chatbots. We anticipate that the user will prefer to converse with the top chatbot because the responses are both relevant and engaging, while the bottom chatbot generates relevant but not engaging responses. On the other hand, the turn-level of engagement evaluation can be more beneficial in providing immediate signals to train higher quality and more engaging dialogue systems. Therefore, in this chapter, we focus on learnable turn-level engagement evaluation metrics. 3.2 AnalysisofEngagementScores We propose to build a learnable model for turn-level engagement scores to evaluate open-domain dialogues (Ghazarian et al., 2020). Due to the shortage of training datasets containing turn-level engagement scores, we resort to heuristic-based techniques to extract turn-level engagement scores from existing conversation-level engagement dataset. In this section, we discuss turn- level and conversation-level engagement scores, investigate their connections, and introduce our proposed simple yet effective approach to get a turn-level training dataset. 3.2.1 Conversation-levelEngagementScores Engagement is defined as a user’s inclination to continue interacting with a dialogue system (Inoue et al., 2018; Ma, 2018). In many existing chatbot competitions like NeurIPS ConvAI 1 and Amazon Alexa prize 2 , users are asked to evaluate whole conversations based on how engaging and attractive they are in maintaining interaction. We define this as conversation-level engagement scores. We utilize the ConvAI dataset since it is publicly accessible. In human evaluation rounds of ConvAI competition, participants and volunteers conversed with a human or a chatbot via Telegram and Facebook messaging services, where their peers had been randomly assigned to them (Logacheva et al., 2018). From overall 4750 dialogues, the majority of conversations were between a human and a bot and 526 were human-to-human conversations. The interlocutors, participants and chatbots, rated utterances as well as conversations on different conversational aspects, where engagement was collected at the conversation-level in the range of 0 to 5 (0 as not engaging at all and 5 as extremely engaging). Engagement scores for human-to-human conversations were calculated by averaging user ratings, while for human-to-bot conversations, 1 http://convai.io/2017/data/ 2 https://developer.amazon.com/alexaprize 14 EngagementScores 0 1 2 3 4 5 Conversations 1690 21 81 47 147 63 Utterances 10122 45 238 444 1492 783 Table 3.1: Data statistics of the ConvAI evaluation dataset. The first row shows conversations with their corresponding engagement scores extracted from the original ConvAI dataset; the second row contains the number of utterances and their engagement scores automatically assigned by our heuristics. only the human’s opinion was used as a dialogue’s engagement score. The first row in Table 3.1 demonstrates the distribution of the conversations with different engagement scores. 3.2.2 Utterance-levelEngagementScores To explore the efficiency of incorporating engagement into existing successful automatic evaluation metrics measuring relevancy at the utterance level (Tao et al., 2018a; Ghazarian et al., 2019b), we need to investigate whether or not an engagement score can be measured at the utterance level. We propose to first study whether humans are capable of scoring engagement of a response for a given query without knowing any context or previous utterances. To achieve this, we executed experiments to check users’ agreement level about engagement of each utterance. We conducted Amazon Mechanical Turk (AMT) experiments on randomly selected 50 conversations from ConvAI, 25 human-to-human and 25 human-to-bot dialogues. Overall 297 utterance pairs have been extracted and rated by annotators in the same range (1-5) of engagement score in ConvAI. 49 workers participated in about 215 surveys, where each utterance pair has been annotated by 5 individual workers. We rejected users that did not pass attention-check tests in the surveys and reassigned their pairs to other workers. Eventually, as Table 3.2 demonstrates, the mean𝘅 agreement and mean Pearson correlation between evaluators participating in our experiments were 0.52 and 0.93. In the context of dialogue system evaluation where agreement is usually quite low (Venkatesh et al., 2018; Ghandeharioun et al., 2019; Yi et al., 2019), these numbers show relatively high agreement between annotators. This provides evidence that engagement can be measured not only at the conversation level but also at the utterance level. 3.2.3 Utterance-levelandConversation-levelEngagementScores The high inter-annotator agreement on utterance-level engagement scores motivated us to study if conversation-level engagement scores can be converted into utterance-level ones. This can be beneficial as we can leverage them to train utterance-level engagement scorer since there are no 15 Utterances Annotators KappaAgreement Pearson 297 49 0.52 0.93 Table 3.2: The results for the Amazon Mechanical Turk (AMT) experiments on utterance-level engagement. 49 annotators annotated 297 utterances and demonstrated quite high inter-annotator Kappa agreement and Pearson correlation between annotations. AggregationMethod PearsonCorrelation(p-value) Min 0.49 (<3e-4) Max 0.72 (<4e-9) Mean 0.85 (<9e-15) Table 3.3: The Pearson correlation between engagement scores of 50 randomly selected conversa- tions from ConvAI and the aggregated engagement scores of their utterances annotated by AMT workers with different aggregation methods. prior datasets for utterance-level engagement scores. The scorer can later be incorporated into existing automatic evaluation metrics. Hence, we ask the following research questions: • Is there a high correlation between the aggregated utterance-level engagement scoresandconversation-levelengagementscore? For this purpose, we used the en- gagement scores annotated by AMT workers for 297 utterances of ConvAI dataset, where each utterance’s engagement score was the average of five individual annotators ratings. In order to calculate the intended correlation, we considered the engagement score of each con- versation as the ground-truth and aggregated its utterances’ engagement scores annotated by AMT workers to get the predicted conversation engagement score. Table 3.3 shows the computed correlations using different aggregation methods. The highest correlation is based on mean aggregation of utterance-level engagement scores, which is presented in the left scatterplot of Figure 3.2. Considering minimum or maximum aggregation of engagement scores for utterances as the conversation’s overall score leads to lower correlation since not only all utterances of a good conversation are not engaging but also all utterances of a bad conversation are not boring. • Isthereahighcorrelationbetweenutterance-levelandconversation-levelengage- mentscoresassignedtoallutterancesintheconversation? In this part, we assigned the ConvAI conversation-level engagement scores to each of its utterances and then computed the Pearson correlation between these assigned scores and the scores from AMT workers. The computed Pearson correlation was 0.60, a relatively high correlation that has been de- picted in right scatterplot of Figure 3.2. There are cases where the difference between human 16 00 4.0 • 5 • 0) • � • • • • 0 • • u 3.5 • 00 • • • • rfJ • 0) • • • • � 4 • • ob 3.0 0 • • d u • • � rfJ • • 2.5 • • • • 00 01)3 • • 0) d • • u � • • d 2.0 • • • • • 00 • • • � 0) 2 0) 1.5 u • • � § • • • • � • � 1.0 � • • • � 0) • • • 0 � 1 • • � § 0.5 � • • • • • 0) • • • ::S 0.0 • • • • • 0 • • • • • 0 1 2 3 4 5 0 1 2 3 4 5 Conv-level Eng. Scores Assigned Conv-level Eng. Scores Figure 3.2: The left scatterplot depicts the correlation between the ground-truth conversation- level engagement scores and the mean aggregation of engagement scores of utterances for 50 conversations conducted in AMT experiment. The Pearson correlation value is 0.85. The right scatterplot depicts the correlation between the engagement scores of 297 utterances annotated by human in AMT experiment and heuristically assigned conversation-level engagement score to all utterances in the conversation. The Pearson correlation value is 0.60. ratings and assigned scores is clearly visible. Even though there are these mismatches, there is no publicly available dataset containing utterance-level engagement scores. The relatively high correlation between these scores enabled us to assign conversation-level scores to all utterances in the ConvAI dataset and used it for further experiments. The second row in Table 3.1 shows these utterances with their assigned engagement scores. As is shown in Table 3.1, the majority of utterances have zero engagement scores and the remaining are accumulated near labels 4 and 5. Therefore, we split the range of engagement scores from 1 to 5 into a binary range (considering all scores less than or equal to 2 as not engaging and greater than 2 as engaging); around 80 percent of the utterances are labeled as not engaging, and the remaining as engaging. 3.3 EngagementClassifier As there is an absence of baseline models for automatically measuring utterance-level engagement scores, we consider one feature-based model and one neural-based model as baselines. • The feature-based model is an SVM classifier with a predefined set of features including n-grams, length of each response and number of distinct words in each response. • The neural-based model is a classifier with static word2vec embeddings as input and two Bidirectional Recurrent Neural Networks (Bi-RNNs) to map words embeddings into vector representations for both query and response, with a Multilayer Perceptron (MLP) classifier on top of the concatenated vector of each utterance pair. 17 Q: What is your major? R: I study math and you? 𝑣 "# 0 1 BERT BERT 𝑣 " 𝑣 $ 𝑣 % 𝑣 "& 𝑣 "' 𝑣 "( 𝑣 ") 𝑣 $# 𝑣 $& 𝑣 $' 𝑣 $( 𝑣 $) 𝑣 $* Pooling Pooling Pooling Figure 3.3: An illustration of the proposed utterance-level engagement classifier. • Our proposed engagement classifier is shown in Figure 3.3. It takes a pair of query and response as input and classifies it as engaging or not engaging. We choose to use BERT (De- vlin et al., 2018) embeddings as input to our model since Ghazarian et al. (2019b) showed superior results using BERT to evaluate the relevance of a response. The utterance vectors are computed by simply taking the max or mean pooling of their contextualized word embeddings. This works because these embeddings are computed by pretrained deep bidi- rectional transformers that already have information of the context. Ghazarian et al. (2019b) showed simple pooling strategy worked better than adding an additional BiLSTM layer when computing relevance. The utterance vectors of query and response pairs are then passed through an MLP classifier with cross entropy loss to classify the utterance as 0 (not engaging) or 1 (engaging). 3.4 ExperimentalSetupandResults In the experiments, we explore the efficacy of augmenting engagement scores inferred by our proposed model on open-domain dialog evaluation. We train our proposed model on the ConvAI dataset and then fine-tune it on the Daily Dialogue dataset that we use to evaluate the performance of automatic evaluation metrics. We augment predicted engagement scores with relevancy scores from baseline models and examine their correlation with human judgements. 3.4.1 BaselineModels In order to study the efficacy of combining engagement scores with existing evaluation metrics for dialogue systems evaluation, we use the unreferenced scores in RUBER (Tao et al., 2018a) and Contextualized RUBER (Ghazarian et al., 2019b) as the baseline metrics. In our experiments, we did not consider the referenced metric that measures the similarity between generated responses with references since Ghazarian et al. (2019b) showed that considering only the unreferenced scores yielded higher correlation with human judgements. The unreferenced score proposed 18 by Tao et al. (2018a) is computed by an MLP neural model which is trained with a ranking loss. This loss function maximizes the inferred score between positive and negative samples which are obtained from dataset and randomly matched query and response pairs, respectively. For the Contextualized RUBER baseline, we considered the best model proposed by Ghazarian et al. (2019b) which is an MLP classifier that takes contextualized embeddings as richer representation of words and uses cross-entropy loss function. 3.4.2 Datasets In order to explore the effect of engagement score on existing automatic evaluation metrics including RUBER and contextualized RUBER, we needed a dataset to train the proposed en- gagement classifier and a dataset to train an automatic dialogue evaluation metric to compare with the baselines. We used the ConvAI dataset for the first purpose since it has annotations for conversation-level engagement scores. We used the Daily Dialogue dataset to evaluate the efficiency of the utterance-level engagement scores for open-domain dialogue evaluation. 3.4.2.1 ConvAI To train the utterance-level engagement model, we used the engagement scores of conversations in ConvAI assigned to 13,124 utterance pairs as input data shown in the second row of Table 3.1. We split this dataset into 60/20/20 parts as train/validation/test sets. Table 3.4 shows these sets with the number of utterances labeled as 0 or 1. 3.4.2.2 DailyDialogueDataset The Daily Dialog dataset 3 is an open-source multi-turn open-domain dialogue dataset that includes daily conversations between humans on different topics. We used a part of this dataset including 22,000/1,800/2,100 pairs of train/test/validation sets for training the relevancy score of RUBER and contextualized RUBER as baselines models. In order to explore the effects of engagement scores on automatic evaluation metrics, we used the following datasets. In subsequent sections, we refer to each dataset based on specified names. • 300 utterances with generated replies: this is a human annotated dataset 4 about the quality of 300 utterance pairs randomly selected from the test set of the Daily Dialogue dataset released by Ghazarian et al. (2019b), where replies are generated based on an attention-based sequence-to-sequence dialogue model. • 300 utterances with human-written replies: Most replies in the above mentioned dataset are completely off-topic and do not make sense; therefore the engagement score will not add extra information about them. In order to have a fair assessment of successful dialogue systems that mainly include relevant responses, we repeated the experiments done 3 http://yanran.li/dailydialog 4 http://vnpeng.net/data/DailyDialog_annotated.zip 19 Engagement=0 Engagement=1 Train 6222 1562 Validation 2121 575 Test 2062 582 Table 3.4: ConvAI train/valid/test sets of utterances with their engagement score labels 0.70 0.58 0.72 0.59 0.76 0.64 0.74 0.61 Figure 3.4: Balanced accuracy (ROC AUC) of different utterance-level engagement classifiers on the ConvAI test set (the last row in Table 3.4) and human annotated test set (Table 3.2). The first two groups of bars show SVM and MLP classifier performance based on word2vec embeddings; the remaining bars are our proposed classifiers based on BERT embeddings with mean and max pooling strategies. by Ghazarian et al. (2019b) on the same 300 queries but with their ground-truth responses that mostly are relevant but not always engaging. We asked evaluators to judge each response’s overall quality in the range of 1 to 5 (low quality to high quality). Each pair is annotated by 3 individual workers; overall 24 annotators contributed in this experiment. 3.4.3 ImplementationDetails We trained our proposed model for utterance-level engagement score along with two baseline models on ConvAI dataset. Due to the imbalanced nature of this dataset, we used a weighted loss function for training purposes and balanced accuracy scores for evaluation. We trained the SVM classifier with a linear kernel function and 0.1 C parameter. Word2vec embeddings used in the neural baseline classifier are 300 dimensional embeddings trained on about 100 billion words of the Google News Corpus (Mikolov et al., 2013). The baseline neural model is a one layer MLP classifier with tanh as the activation function, a learning rate of 10 −5 and 0.8 dropout rate. Our proposed model uses BERT 768 dimensional vectors pre-trained on the Books Corpus and English 20 0.06 0.09 0.46 0.53 0.16 0.03 0.52 0.53 Figure 3.5: Pearson and Spearman correlations between predictive engagement scores and human judgements for generated and human-written responses. Wikipedia as words embeddings (Devlin et al., 2018). The model is trained with a weighted cross entropy loss function. The MLP classifiers are 3-layer networks with 64, 32 and 8 hidden units. Learning rate in the MLP classifier based on mean pooling of word embeddings is 10 −3 , while with max pooling it is10 −2 . The performance of all trained models has been demonstrated in Figure 3.4. The blue bars show the balanced accuracy of models on the ConvAI test set (Table 3.4), while the green bars show the balanced accuracy on utterance pairs of 50 conversations annotated by AMT workers as another test benchmark. According to results from Figure 3.4, our proposed models based on BERT embeddings perform better in terms of accuracy, which will be used for inferring engagement scores of utterances in the Daily Dialogue dataset. 3.4.3.1 TransferLearning After training utterance-level engagement classifiers, we fine-tuned them on a small set of utterance pairs randomly selected from the Daily Dialog dataset excluding the pairs in 300 utterances for assessing automatic evaluation metrics. Indeed, the ConvAI dataset that the engagement models are trained on is the source domain, and the selected dataset for fine tuning is the target domain. We recruited about 45 participants from AMT to annotate 300 pairs from Daily Dialog dataset as engaging or not engaging. Around half of the selected pairs were from Daily Dialogue queries and their ground truth responses that mostly are part of engaging conversations between two humans. The other half were queries and responses generated by attention-based sequence-to-sequence dialogue system that mostly were not engaging. We attained a mean𝘅 agreement of 0.51 between users that passed the attention-check tests attached to AMT surveys. 3.4.4 ExperimentalResults Performance of automatic evaluation metrics for open-domain dialogue systems are measured based on their correlation with human judgements. Higher correlations indicate these metrics can be a great substitution for human evaluations. 21 3.4.4.1 QuantitativeResults We inferred the engagement scores from fine-tuned utterance-level engagement models for the 300 utterances with generated replies and aggregated them with the relevance scores obtained from the Contextualized RUBER model (Ghazarian et al., 2019a). We only included the mean aggregation of relevance and engagement metrics that resulted in the highest correlation in comparison with the other two aggregations (minimum and maximum) that we tried. Each part of the table 3.5 illustrates the correlations between human judgements with relevance, engagement and the combination of these two metrics respectively. As is shown in the first part of Table 3.5, the correlations between human judgements and the two evaluation metrics are very close to the baseline metrics that only compute relevance scores. Many off-topic replies generated by the attention-based sequence-to-sequence dialogue system could be the reason for this observation. According to the second part of Table 3.5, the Pearson and Spearman correlations between human judgments and the relevance scores for the 300 utterances with human-written replies is low. Incorporating engagement scores leads to higher correlations with human judgements. Indeed, the baseline models score the majority of human-written responses very high, while users consider other aspects such as engagement for giving the utterance an overall quality score. Figure 3.5 more clearly depicts that the correlation between human judgements and engagement- only scores on the 300 utterances with generated replies is low. This is probably because the annotators do not pay attention to other aspects like engagement for evaluating a response that is not relevant to a given query. Figure 3.5 also illustrates the positive effect of considering engagement scores in evaluating the human-written responses. We combined the two sets from the Daily Dialogue dataset, and the last part in Table 3.5 shows the correlations on the combined 600 query-reply pairs. The higher correlations between human annotations with relevance and engagement scores illustrate the success of applying engagement as an extra score to baseline metrics in order to have a better automatic evaluation system. 3.4.4.2 SignificanceTest To see whether the improvements of the correlation is statistically significant, we applied hy- pothesis testing to compare the dependant correlations with overlapping variables; in our case the human judgements (Diedenhofen and Musch, 2015). According to hypothesis testing, the probability of the null hypothesis, which states that two correlations are equal or not that much different is ≤ 0.05; thus, the improvement is significant. 3.4.4.3 CaseStudy Some real examples from the Daily Dialogue dataset are shown in Table 3.6, which demonstrates the positive influence of aggregating engagement score with relevance score in order to have much closer evaluations to human judgements. 22 3.5 Conclusion We hypothesized that it is not adequate to compare open-domain dialogue systems solely based on the relevance of the responses. An utterance-level engagement measurement can be built to improve the automatic open-domain dialog evaluation metrics and consequently used for better training dialogue models. To this end, we verified the feasibility of measuring utterance-level engagement and showed a high correlation between the utterance-level and conversation-level engagement scores. We incorporated the utterance-level engagement scores inferred by our proposed model into other relevance-based evaluation metrics and showed an improved correlation to human judgments. 23 Dataset Metric Pearson Spearman 300 Generated Responses RUBER_relavance 0.28 0.30 Ctx_RUBER_relevance 0.55 0.45 MLP BERT(mean) 0.06 0.16 MLP BERT(max) 0.09 0.03 MLP BERT(mean) + Ctx_RUBER_relevance 0.48 0.48 MLP BERT(max) + Ctx_RUBER_relevance 0.52 0.44 300 Human- written Responses RUBER_relavance 0.04 0.02 Ctx_RUBER_relevance 0.14 0.12 MLP BERT(mean) 0.46 0.52 MLP BERT(max) 0.53 0.53 MLP BERT(mean) + Ctx_RUBER_relevance 0.36 0.39 MLP BERT(max) + Ctx_RUBER_relevance 0.32 0.39 600 Generated and Human-written Responses RUBER_relavance 0.24 0.30 Ctx_RUBER_relevance 0.54 0.55 MLP BERT(mean) 0.39 0.45 MLP BERT(max) 0.35 0.36 MLP BERT(mean) + Ctx_RUBER_relevance 0.61 0.64 MLP BERT(max) + Ctx_RUBER_relevance 0.60 0.62 Table 3.5: Pearson and Spearman correlations between human judgements and several automatic dialogue evaluation metrics on generated responses, human-written responses, and their mixture. We adopt the mean aggregation of the relevance score of contextualized RUBER mentioned as Ctx_RUBER and the predictive utterance-level engagement scores. The first two rows in each group show correlations between human judgements and baseline models with only relevance scores, the middle two rows are for only engagement scores and the last two rows add engagement scores into relevance scores. Boldface indicates the improvements are significant compared to the baseline in the corresponding group (p<.05). 24 Query Response Human Rel. Eng. Rel. + Eng. OK. What’s the rea- son you are sending her flowers? Today’s her birth- day and she told me she wants me to buy her flowers. 0.92 0.99 0.88 0.94 The kitchen may be large, but it doesn’t have any storage space. The master suite is supposed to be quite elegant. Maybe it will be a little better. 0.75 0.65 0.94 0.80 Not long, because people rush for lunch. The line sure does move fast. 0.33 0.82 0.11 0.47 That’s a good idea. And remind them to be slow at the be- ginning, not to run into the railings. OK. Anything else? 0.42 0.84 0.14 0.49 Table 3.6: Examples of evaluation scores for utterances from the Daily Dialogue Dataset. We used unreferenced score of Contextualized RUBER as relevance score and MLP BERT(mean) as engagement score, where the numbers are rounded into 2 digits. The incorporation of engagement scores into relevance scores yields scores closer to human judgements – the main goal of automatic evaluation metrics. 25 Chapter4 CoherenceEvaluationofOpen-DomainDialogueSystems Thus far, we have talked about relevance and engagement as two main turn-level evaluation metrics. We employed simple heuristics such as random matching and score assignment to construct train datasets. Since the quality of these metrics is strongly dependent on the quality of training data, we show such heuristic approaches are insufficient to appropriately reflect the issues that occur in the interactions between advanced dialogue models and humans. We investigate more advanced semantic-level manipulations to better replicate issues by state-of-the-art dialogue systems. We propose an automatic evaluation metric that is able to evaluate the overall coherence of the conversations by being trained on samples resulted from semantic-level perturbations applied to the positive and coherent human-human conversations. To this end, we propose the injection of various types of incoherence sources, such as coreference inconsistency, irrelevancy, contradictions, and decrease engagement, to theAbstractMeaningRepresentation(AMR) of dialogues which shows the conversation in more abstract-level. 4.1 DifferentPerturbationsforCoherenceEvaluation Most trainable automatic evaluation metrics focus on turn-level interactions, where they learn to assess the quality of one user-system utterance pair (Tao et al., 2018b; Huang et al., 2020a; Ghazarian et al., 2020). However, these metrics cannot appropriately model the whole conversation flow (Yeh et al., 2021), and thus are insufficient for dialogue-level evaluation. In this chapter, we focus on the automatic evaluation of the coherence of dialogues, which is under-explored. Coherence is a conversation-level metric that measures how well the utterances in a conversation are unified leading to a consistent interaction (Byron and Stent, 1998; Mesgar et al., 2020). Previous works pursue different models such as graph-based (Vakulenko et al., 2018; Zhang et al., 2021) or text-based (Mesgar et al., 2020) approaches to develop automatic trainable coherence evaluation metrics. Those models take acontrastivelearning approach, where they build binary classifiers to differentiate positive or coherent examples from negative or incoherent dialogues. Those classifiers are usually trained on datasets constructed using human-human conversations as positive examples and applyingtext-level heuristic manipulations to generate incoherent con- versations. The text-levelmanipulations directly change the structures of the conversation, such 26 Not everyone is into this type. Have you seen Love, Death and Robots? Did you like it? Yes, we watched in three hours. Do you like science fantasy? Yes, I watched it with my friend. Yes, I saw one in a factory a few weeks ago. Not that much. Have you seen Love, Death and Robots? one episode? I asked you a question What was the question? Which is your favourite episode? Have you seen Love, Death and Robots? Did you like it? She lives in NYC. I do know its director. He should have a very creative mind. They spend many years on it. Nice to chatting with you! Have you work for Love, Death and Robots? Did you like me? Not that much. No job is about this type. Yes, we watched in three hours. I ignored to watch it. Yes, I watched it with my friend. Figure 4.1: Examples of human-human (first dialogue) and human-chatbot (second dialogue) conversations alongside manipulations resulted from baseline (indicated with yellow color) and our proposed perturbations (indicated with purple color), respectively. Similar to the human- chatbot interaction, our manipulations result in more subtly-incoherent dialogue compared to baseline manipulations. as shuffling the order of utterances, replacing some random utterances from external conversa- tions (Vakulenko et al., 2018; Mesgar et al., 2020; Zhang et al., 2021), as shown in the third dialogue of Figure 4.1. We posit that such text-level manipulations are too simplistic to adequately represent more nuanced coherence errors presented in the current state-of-the-art dialogue systems. For ex- ample, the second conversation in Figure 4.1 shows a human-system interaction from the FED dataset (Mehri and Eskénazi, 2020), where the incoherence is much more subtle than the ones created by text-level manipulations. we investigate manipulation techniques to generate negative samples that represent coherence errors more likely to happen in the state-of-the-art dialogue systems. To this end, we propose DEAM 1 , a model that uses Abstract Meaning Representation (AMR) to apply semantic-level manipulations to generate negative examples (Ghazarian et al., 2022b). AMRs are intended to capture the meaning of a sentence by abstracting away irrelevant syntactic features. Thus, injecting targeted and controlled perturbations into an AMR is easy and can introduce semantic incoherence into the corresponding sentences. 1 Our proposed manipulations, data, and trained models can be found athttps://github.com/PlusLabNLP/DEAM 27 … Text-to-AMR AMR-to-Text RoBERTa Ø We could take the bus there. Ø It's too crowded Ø Another bus came here. Ø Fine, let's get on. Oh no, get off the bus quickly. Ø The bus can run for the bus there. Ø I am too crowded Ø Another bus came here. Ø Fine, let's get on. Oh no, get off the bus quickly. the bus can’t run the bus. … … … AMR-based Manipulations Figure 4.2: Overall architecture of DEAM metric trained on positive (green box) interactions and negative (red box) conversations generated from AMR-based manipulations (orange box) 4.2 DEAMOverview Our goal is to build an evaluation metric that measures the conversation-level coherence of dialogues. We follow the trainable evaluation metrics (Vakulenko et al., 2018) to formulate the evaluation as a classification task. We train the evaluator on positive (coherent) and negative (incoherent) conversations, and take the predicted probability for the positive class as the coherence score. As is discussed above, the main challenge for building a reliable metric is to obtain negative samples that can adequately represent the incoherence issues presented in advanced dialogue systems. To this end, we propose to generate negative examples by leveraging AMR-based manipulations. We then build a RoBERTa-based classifier as the evaluation metric by fine-tuning RoBERTa on the automatically generated training data. Figure 4.2 illustrates an overview of our proposed evaluation method. The first step of DEAM is to apply Text-to-AMR models to the conversations. Text-to-AMR or AMR parsing (Jin and Gildea, 2019; Xu et al., 2020; Zhou et al., 2021a; Lam et al., 2021) that translates conversation texts to directed and acyclic AMR graphs containing relation edges between concept nodes (Banarescu et al., 2013) has been effectively accomplished by transformer-based models in a sequence-to-sequence (seq2seq) training fashion (Xu et al., 2020; Zhou et al., 2021a). We use the fine-tuned T5 (Raffel et al., 2020a) model 2 for this purpose. We then manipulate the AMR graphs (section 4.3.2) and back translate them into conversation texts to be used as negative examples for training the text-based coherence evaluator. Similar to AMR parsing, we use finetuned T5 3 that is shown to be effective for the AMR-to-Text generation task (Mager et al., 2020; Ribeiro et al., 2020). 2 We leverage the released parse_t5 model fromhttps://github.com/bjascob/amrlib 3 We leverage the released generate_t5wtense model fromhttps://github.com/bjascob/amrlib. 28 I read a few of his plays when I was in school. How about you? Figure 4.3: AMR representation of a dialogue utterance 4.3 IncoherentDialogueGeneration The challenge that automatic trainable evaluation metrics face is in providing training data that can appropriately replicate moderate to low quality conversations with incoherence sources that usually happen in the current dialogue models. The common solution is to apply manipulations to positive conversations. In this section, we summarize the baselines manipulations and state our proposed AMR-based perturbations. 4.3.1 BaselinesManipulations Baseline manipulations can be classified as: 1) Shuffling-based manipulations: In such manipulations, turns order (Vakulenko et al., 2018), sequence of speakers utterances (Mesgar et al., 2020; Vakulenko et al., 2018; Zhang et al., 2021), or the position of the first and second sections of conversations (Vakulenko et al., 2018) are swapped. 2) Insertion-based manipulations: This group of manipulations add incoherence sources by replacing (Mesgar et al., 2020; Zhang et al., 2021) or inserting (Mesgar et al., 2020) a random utter- ance from a randomly selected conversation. Each baseline metric fuses multiple manipulations, hence we use their citations (Vakulenko et al., 2018), (Mesgar et al., 2020) to easily refer them in later sections. 4.3.2 AMR-basedManipulations AMR is originally proposed by Banarescu et al. (2013) as a semantic representation language that helps to abstract away the text from surface syntactic. Many abstract-level semantic information such as named entities, negations, questions, coreferences and modalities in the texts can be encoded by AMR graphs. These potential capabilities of AMR make it lucrative in many semantic- related NLP tasks such as summarization (Liao et al., 2018) and machine translation (Song et al., 2019). Conversations between two interlocutors contain many semantic details that can be captured by these graphs. Therefore, we explore AMR features’ usage in the dialogue systems evaluation 29 task by manipulating the AMR graphs of coherent conversations, each manipulation reflecting a specific reason for incoherence in dialogue systems. Figure 4.3 demonstrates a linearized version of an utterance AMR graph. In AMR graphs, entities and concepts are shown as nodes and their relations are depicted with various relation edges (Banarescu et al., 2013). Each AMR concept is either a word, or a PropBank framesets keyword (Kingsbury and Palmer, 2002). The PropBank framesets with their predefined arguments are used to abstract away concepts from syntactic structures. As an example,located concept of PropBank framesets in Figure 4.3 comes with two arguments the subject (i) and the place (school). In DEAM, we pursue the idea of manipulating abstract-level semantic information extracted from AMRs to generate incoherent conversations. In this work, we mainly focus on four major logical flaws that could happen in state-of-the-art dialogue models such as cases when a chatbot contradicts its previously stated utterances, uses incorrect coreferences, responds users with irrelevant utterances, does not engage enough in the conversation. We explain each of these logical flaws in detail. 4.3.2.1 Contradiction One of the common issues that dialogue systems struggle with is directly or indirectly contradicting previous utterances in dialogue. To replicate this type of error, a contradicted version of a subgraph from the original AMR is copied to other locations. This negative form AMRs can be accomplished by directly addingpolarity to the concepts or replacing concepts with their antonyms that hold Antonym, NotDesires, NotCapableOf, and NotHasProperty relations in ConceptNet (Speer and Havasi, 2012). After adding contradictions, the AMR-to-Text model will use the encoded context to output incoherent yet natural conversations. In the bottom right example of Figure 4.4, speaker B contradicts its previously stated opinion that badly effects the linkage of the utterances. 4.3.2.2 CoreferenceInconsistency The coherence of a conversation is preserved by the correct references of previously mentioned entities and words in the dialogue context. Pronouns in the conversation play an essential role in this regard. Coreferences in AMRs are presented as arguments (ARG) and all three different types of pronouns such as subjective, objective and possessive pronouns are shown in their subjective format. To disrupt the coreferences relations, we randomly replace some pronouns in the conversation’s AMR with another pronoun or noun identified as ARG or operand (op) from the same conversation. After replacements, the AMR-to-Text model adapts other sections of the utterance accordingly and reassures us that outputs have natural look and correct grammar. The third utterance in Figure 4.4 demonstrates an example of coherence inconsistency which makes the utterance to be not logical. 30 A1: Have you watched Sesame Street? B1: I used to when my kids were young. I liked Oscar the Grouch. He seemed realistic. A2: He was one of my favorite character as well, why is he green though? I’ve always wondered that. B2: He was once orange though. [Original] …… Irrelevancy [After] A1: You listen to Sesame Street? (w / watch-01 :ARG0 (y / you) :ARG1 (b / broadcast-program :name (n / name :op1 "Sesame" :op2 "Street")) :polarity (a / amr-unknown)) [Original] (w / listen-01 :ARG0 (y / you) :ARG1 (b / broadcast-program … AMR Mnplt. AMR Mnplt. Co-reference Inconsistency [After] A2: They are among my favorite characters as well. (Question removed) I’ve always wondered that. (m / multi-sentence :snt1 (ii / include-91 :ARG1 (h / they) :ARG2 (c / character … (m / multi-sentence :snt1 (ii / include-91 :ARG1 (h / he) :ARG2 (c / character :ARG1-of (f / favor-01 :ARG0 (ii2 / i))) :mod (a / as-well)) [Original] Decreased Engagement [After] A2: They are among my favorite characters as well. (Question removed) I’ve always wondered that. (m / multi-sentence :snt1 (ii / include-91 … original :snt2 removed. :snt2 (w / wonder-01 … AMR Mnplt. (m / multi-sentence :snt1 (ii / include-91 … :snt2 (h2 / have-concession-91 :ARG1 (g / green-02 :ARG1 (h3 / he) :ARG1-of (c2 / cause-01 :ARG0 (a2 / amr-unknown)))) :snt3 (w / wonder-01 … [Original] Contradiction [After] B2: He was orange once though, I used to be when my kids were young. I hate Oscar the Grouch, he doesn't seem realistic. … :snt3 (h / hate-01 :ARG0 (ii2 / i) … AMR Mnplt. (Copy Negate Insert) … :snt3 (l / like-01 :ARG0 (ii2 / i) :ARG1 (p2 / person :name (n / name … [Original] Figure 4.4: An abbreviated sample conversation to illustrate four different AMR-based DEAM manipulations 4.3.2.3 Irrelevancy Random utterance substitution from other conversations is a simple way to inject incoherence sources in dialogues, which has been frequently used in prior work (Tao et al., 2018b; Ghazarian et al., 2019c; Mesgar et al., 2020; Zhang et al., 2021). Conversations with completely off-topic utterances are rarely generated by advanced dialogue models due to their ability in encoding dialogue history for continuing the conversation. We propose to apply irrelevancy sources to AMR graphs. We select some AMR items such as concepts, ops, ARGs and replace them with random items from other utterances. In this approach, the replacement items are not from randomly selected conversations but still, they do not fit well in their newly selected locations which hurts the coherence of the conversation. In Figure 4.4 watch is replaced withlisten. The benefits of using AMR-to-Text model emerge here where some new adaptations (such as to) have been augmented with new verb replacement to give the utterance a fluent look. 31 4.3.2.4 DecreaseEngagement In coherent conversations, speakers exchange opinions about different topics by stating detailed information, asking and answering questions. This coherence will be faded if one of the inter- locutors evades to answer questions or talk in detail. In contrast to previous works that ignored this important feature, we augment such kind of incoherence sources into the negative sampling generation. In order to decrease the engagement of coherent conversations, we take the advantage of AMRs which are able to demonstrate detailed utterances and those containing questions. In AMR graphs, detailed utterances include more number of nested layers and concepts, ARGs and ops. Question-type utterances can be easily distinguished via amr-unknown concept notation and therefore with relying on AMRs the goal of decreasing engagement in the conversation is easily achievable. We propose three different approaches to decrease the engagement and consequently the coherence of the conversations: 1) Remove question-type utterances in the conversation: we select a multi-sentence utterance including amr-unknown concept and remove it and all its children nodes from the graph; 2) Remove the most detailed utterance in the conversation: the utterance having the largest depth in the graph is selected as the utterance with the most transferred information and all its children alongside its parent concept are removed from the graph; 3) Remove fine-grained information in the utterances: the main concepts’ detailed information that are presented asARG orop in the AMRs are randomly selected and eliminated from the graph. The higher-level concepts in the graph are preserved while its lower-level child nodes are deleted which makes the utterance not transfer its meaning and diminishes the linkage of topics. The question part in the third utterance of Figure 4.4 has been removed causing the coming utterances to not be completely sensible. 4.4 ExperimentalSetup We compare DEAM and its negative example generation techniques with baseline models and manipulations. We aim to have a data-driven analysis under three setups: Setup1): In this setup, we compare DEAM with baseline models by varying both the data manipulation strategies and the classification models. We fix the positive examples to be the same set of human-human conversations. Setup2): Since baseline models are trained on different datasets, we conduct pairwise compar- isons between DEAM and each baseline evaluator by training on the baseline’s dataset. Note that we only take the positive examples from the baseline’s dataset, and apply different manipulations to get negative examples to compose a balanced set for training. We also train different classifiers (DEAM vs. baselines) for the evaluators. Setup 3): This setup is designed to show the effectiveness of different manipulations to generate negative examples. We fix the positive examples and the classifier (i.e., RoBERTa). 32 Dataset size conv. len utt. len Topical_DEAM 17.3k/2.2k 530/530 24/24 Persona_DEAM 17.9k/2.0k 187/202 13/13 FED 125 168 11 DSTC9 2.2K 318 11 Table 4.1: Statistics (size, average length of conversations and utterances) of TopicalChat and Per- sonaChat train/valid datasets (augmented with AMR-based manipulated conversations), alongside with FED and DSTC9 test datasets. 4.4.1 Datasets 4.4.1.1 TrainingDatasets We conduct our experiments on two crowd-sourced datasets, TopicalChat (Gopalakrishnan et al., 2019a) and PersonaChat (Zhang et al., 2018b). Both datasets are composed of conversations between Amazon Mechanical Turk (AMT) participants. In TopicalChat, AMT workers were supposed to have coherent and engaging conversations regarding the provided reading sets about different topics, while in the PersonaChat dataset coherent conversations were conditioned on the provided 1155 personas each including 5 personality description sentences collected via AMT. We take these conversations as coherent conversations. We follow DEAM’s steps to generate and add a balanced number of incoherent conversations. Table 4.1 shows the train/valid statistics of the newly constructed datasets calledTopical_DEAM andPersona_DEAM. 4.4.1.2 EvaluationDatasets In the literature of automatic evaluation metrics, the prevalent way of assessing the evaluators’ performance is to compare the correlation of their predicted scores with human judgments. FED (Mehri and Eskénazi, 2020) and Interactive Evaluation of Dialog track of the Dialog State Tracking Challenge 9 (DSTC9) (Gunasekara et al., 2020) are two publically available benchmark datasets including human ratings on the coherence aspect of human-human or human-systems conversations. The participants of FED dataset, have judged 125 conversations; 41 human-human, 44 human- Mitsuku chatbot and 40 human-Meena (Adiwardana et al., 2020) chatbot. Humans have assessed the conversations from 11 conversation-level evaluation aspects including the coherence and overall scores. Each conversation in FED is judged by 5 distinct annotators. Coherence and overall scores are in the range of 0-2 and 0-4, respectively. In DSTC9 dataset 4 , AMT workers have rated 2200 conversations between invited participants and knowledge-grounded response generation models using the same 11 fine-grained conversation- level evaluation aspects. Coherence and overall scores that we use in our experiments are in the range of 1-3 and 1-5, respectively. In our experiments, we take the average of judgments for 4 https://github.com/exe1023/DialEvalMetrics 33 conversations with more than one annotator’s ratings and compute the Spearman correlations between human evaluations and evaluator’s generated scores. 4.4.2 ImplementationDetails In our work, we train and run all the models on a machine with a GeForce RTX 2080 Ti GPU. We fine-tune RoBERTa-large pretrained model on Topical_DEAM andPersona_DEAM datasets for three epochs and optimize parameters using Adam optimizer with 1e-5 learning rate. To conduct experiments in setup 1, we train Vakulenko et al. (2018)’s graph-based model for 128 epochs with 1e-5 learning rate. Mesgar et al. (2020)’s LSTM-based model is trained for 8 epochs with 5e-5 learning rate. We retrain DynaEval (Zhang et al., 2021) for 20 epochs. All baselines are trained using Adam optimizer. Due to not publically published models proposed by Vakulenko et al. (2018) and Mesgar et al. (2020), we need to retrain these models on their original datasets; Ubuntu (Lowe et al., 2015) and DailyDialog (Li et al., 2017b); using the same hyperparameters published in the aforementioned papers to complete experiments in setup 2. We use DynaEval’s published checkpoints to run experiments in this setup. In experimental setup 3, we start fromTopicalChat andPersonaChat datasets, and augment negative samples pursuing different manipulation techniques. We fix the evaluator and finetune RoBERTa-large model for 3 epochs with a 1e-5 learning rate. Since Vakulenko et al. (2018)’s proposed manipulations are in the entity level, we adapt the perturbations to the text level by replacing the sequence of entities with a sequence of utterances substitutions to be acceptable by the RoBERTa model. 4.5 Results Through our experiments, we report the Spearman correlation of evaluation metrics with human annotations under different experimental setups. 4.5.1 MetricsPerformance Table 4.2 depicts the quantitative results for different evaluation models on both FED and DSTC9 datasets of experimental setup 1. According to the reported correlations, the superiority of DEAM shown in the last row versus other baselines is obviously recognizable. This superiority could originate from the subtle negative sampling technique. In experimental setup 1, manipulation techniques and models vary between evaluators, there- fore we complete our investigation via experimental setup 2 by conducting one by one comparison of DEAM with each baseline model training each pair on the same dataset that the baseline model has been trained on. Table 4.4 shows the output of this type of pairwise comparisons separated each pair into one section. Even though most of the baseline models correlations increased, yet DEAM takes the lead. It is noteworthy that the correlation of DynaEval reported in the original 34 Model FED DSTC9 Coh Ovrl. Coh Ovrl. Mesgar et al. (2020) 0.10 -0.01 0.02 0.05 Vakulenko et al. (2018) 0.13 0.10 -0.001 -9.6e-5 DynaEval -0.36 -0.4 -0.03 -0.01 DEAM 0.47 0.55 0.19 0.20 Table 4.2: Spearman Correlations of different models with human judgements trained on Top- icalChat and PersonaChat datasets following different manipulations for negative sample generation (setup 1). Manipulation FED DSTC9 Coh Ovrl. Coh Ovrl. Mesgar et al. (2020) 0.29 0.24 0.15 0.14 Vakulenko et al. (2018) 0.29 0.20 0.15 0.14 DynaEval 0.32 0.25 0.14 0.15 DEAM 0.47 0.55 0.19 0.20 Table 4.3: Spearman Correlations of the same RoBERTa-large models finetuned on TopicalChat and PersonaChat datasets augmented with incoherent conversations generated by different manipulation techniques (setup 3). paper for FED dataset has decreased, this could be due to the less number of negative samples that we consider for each positive conversation and its major impact on this model’s performance. 4.5.2 ManipulationsEffect Table 4.3 illustrates the results of experimental setup 3, where we fix both RoBERTa evaluator and TopicalChat andPersonaChat original datasets and apply different manipulation techniques to add negative samples. Even though the correlation for baseline manipulations increased drastically, which shows the effectiveness of strong pretrained language models in better encoding conversations information used for the evaluation task, DEAM’s performance is still higher. This interprets the beneficial effect of AMR-based manipulations. The positive slops of the regression line in Figures 4.5 and 4.6 between DEAM predicted coherence scores and human coherence and overall evaluations for FED dataset show the proposed manipulations superiority from a different 35 Model Manipulations Dataset FED DSTC9 Coherence Overall Coherence Overall Vakulenko et al. (2018) Vakulenko et al. (2018) Ubuntu 0.17 0.15 -0.04 -0.1 DEAM DEAM Ubuntu 0.18 0.25 0.16 0.15 Mesgar et al. (2020) Mesgar et al. (2020) DailyDialog -0.36 -0.47 0.13 0.14 DEAM DEAM DailyDialog 0.34 0.36 0.17 0.18 DynaEval DynaEval empathetic 0.17 0.10 -0.01 -0.02 DEAM DEAM empathetic 0.48 0.47 0.20 0.20 Table 4.4: Pairwise comparisons between DEAM with the proposed AMR-based manipulations and different baseline models using their original datasets, manipulations, and models (setup 2). All models have been trained on balanced sets of coherent/incoherent examples. angle. The distribution of baseline models predicted low scores for high-quality conversations and vice versa present their ineffectiveness in correctly distinguishing between low-quality and high-quality conversations. 4.5.3 AblationStudies Next, we inspect the role of each of the four proposed manipulations in the metric’s performance. We conduct an ablation study onTopicalChat andPersonaChat datasets to assess the effective- ness of each manipulation. For each manipulation, we remove it from the list of manipulations and try to randomly sample one up to three manipulations to create negative samples. In Table 4.5, we witness an overall drop by eliminating each of the manipulations that indicates the positive impact of all of the manipulations on generating higher quality negative samples that are closer to the samples generated by state-of-the-art models and consequently the evaluator’s accuracy. Removing irrelevancy and decrease engagement manipulations have the most detri- mental impact on the metric, which suggests that many state-of-the-art models struggle with such issues. By eliminating these manipulations the model does not have access to such negative examples during training, which significantly limits its ability to detect such incoherences during inference time. On the other hand, omitting coreference inconsistency from the manipulations has the lowest impact on DEAM, specifically for DSTC9 dataset which can be interpreted as the state-of-the-art models are safer regarding such issues. We also note that the performance difference between DSTC9 and FED could be due to the long conversations in DSTC9 that mostly include very limited coreferences. 36 0.0 0.2 0.4 0.6 0.8 1.0 Ground truth 0.0 0.2 0.4 0.6 0.8 1.0 Prediction score FED coherence DEAM (ours) Manipulation w/ = 0.47 DynaEval Manipulation w/ = -0.36 Mesgar Manipulation w/ = 0.1 Vakulenko Manipulation w/ = 0.13 Figure 4.5: Scatter plots and regression lines of different models predicted scores versus FED- coherence human evaluations. Overlapped points are represented darker. 4.5.4 QualitativeAnalysis We analyze the quality of DEAM versus baseline evaluators in terms of examining each model’s performance to distinguish between positive and negative examples constructed leveraging vari- ous manipulations. Some examples are shown in Table 4.6. Figure 4.7 illustrates a heat map of the accuracy scores. X-axis and Y-axis show the manipulations used for creating training and testing datasets, respectively. As is expected the highest accuracies can be found on the diagonal where models have been trained and tested on datasets generated from pursuing the same manipulation techniques. The light-colored cells are mainly related to models trained on baseline manipulated data and tested on AMR-based perturbed data. This indicates that the baseline models trained on Manipulation FED DSTC9 Coh Overall Coh Overall DEAM 0.47 0.55 0.19 0.20 -Contr 0.39 0.42 0.17 0.16 -CoRef_Inconst. 0.41 0.46 0.19 0.20 -Irrel 0.35 0.35 0.17 0.18 -Dec_Eng 0.34 0.35 0.18 0.17 Table 4.5: Correlations of DEAM with human judgments trained on different ablated manipulations. 37 0.1 0.3 0.5 0.7 Ground truth 0.0 0.2 0.4 0.6 0.8 1.0 Prediction score FED overall DEAM (ours) Manipulation w/ = 0.55 DynaEval Manipulation w/ = -0.4 Mesgar Manipulation w/ = -0.01 Vakulenko Manipulation w/ = 0.1 Figure 4.6: Scatter plots and regression lines of different models predicted scores versus FED- overall human evaluations. Overlapped points are represented darker. such types of text-level heuristic manipulations can not perform well and indeed have a random guess on more challenging incoherent examples that are generated by DEAM. While higher accu- racies of DEAM model on baseline test datasets show its capability to more effectively distinguish between positive and heuristically created counterpart manipulations’ negative conversations. Our proposed manipulations in DEAM are directly influenced by the quality of Text-to-AMR and AMR-to-Text generation models. Even though finetuned T5 (Raffel et al., 2020a) models used here have been shown to be effective (Ribeiro et al., 2020), there are still not perfect and suffer from some errors. We conducted a quick analysis of different deficiencies in conversations obtained by AMR-based back-translations. Most of these flaws are due to the fact that in Text-to-AMR generations some syntactic information such as verb tense, passive type of sentences, are removed from the text due to the semantic-based structure of the AMRs. Table 4.7 shows such flaws. Ongoing work on improving AMR parsers and generators will lead to more robust AMR models, thus improving the quality of the proposed manipulations as well. 4.5.5 ManipulationCoverage In the end, we conduct an analysis to explore the coverage rate of our proposed manipulations in the test datasets. To accomplish this, we analyze different commonly occurring logical flaws by advanced dialogue models via randomly selecting 25 low-quality interactions from FED (Mehri and Eskénazi, 2020) and 25 poor dialogues from DSTC9 datasets. The low-quality scores specified by human annotators indicate various types of flaws in the conversations. Our analysis suggests that we can classify those flaws into distinct categories as demonstrated in Figure 4.8. Most of the frequently happening flaws have been covered in our work except not_sensibility showing the sensibility of the generated responses. We leave not_sensibility evaluation for future works as 38 Figure 4.7: The accuracy of evaluation metrics to distinguish coherent/incoherent conversations in test data (y-axis) generated using baseline manipulations (x-axis). Figure 4.8: Statistics of different types of logical flaws observed in 50 randomly selected low-quality interactions between human and state-of-the-art dialogue systems (25 dialogues from FED and 25 from DSTC9) replicating such issues besides the AMR-based manipulations mostly requires external knowledge bases which is not the focus of this work. 4.6 Conclusion Reliable trainable coherence evaluation metrics that can efficiently measure the dynamics of interactions between interlocutors are principally influenced by the quality of the training in- stances. We showed that leveraging text-level manipulations can not adequately mirror the incoherence errors that current dialogue systems face. According to our study, DEAM can more effectively accomplish this task by relying on capabilities that AMR-based semantic perturbations and pretrained language models present. 39 IncoherentConversation Vakulenkoetal.,2018–Permuteboldedsectionwithanotherrandomdialogue ...</UTT>1886isthoughtofasthebirthyearofthemoderncar.</UTT>Yeah,Ididn’tknowthat. ItwashisBenzPatent-Motorwagen.</UTT>Thatistrue,doyouknowwhenthefirstcarwas invented?</UTT>Great question. I love that they experience eureka moments.</UTT>Yeah I never knew that, thats pretty awesome. Do you have a dog?</UTT>I do not have a dog. Do you?</UTT>... Mesgaretal.,2020–Shuffleallutterances ...</UTT>I also do not know why they would need to use the subway.</UTT>Great question. I love that they experience eureka moments.</UTT>I do not have a dog. Do you?</UTT>My friends dogs know like 50 commands! Even the command dance! lol</UTT>That is amazing.</UTT>I did know that. We should teach them more than just simple comands like sit and paw.</UTT>Did you know dogs have 12 different blood types?</UTT>Thats funny. Do you consider dogs to be man’s best friend.</UTT>Yeah thats weird lol it looks weird too.</UTT>Dogs drink with underside of their tongue!</UTT>... Dynaeval–Shuffleonespeaker’sutterances ...</UTT>My friends dogs know like 50 commands! Even the command dance! lol</UTT>Greatques- tion. I love that they experience eureka moments.</UTT>Yeah thats weird lol it looks weird too.</UTT>No,thatiscrazy! IwonderiftheyhavetheirownversionofO-.</UTT>Same here, That is really impressive though, but im not sure how they know which subway to take lol</UTT>Beengreat talkingtoyou.</UTT>Probably because its faster to get around on. Oh and where do they keep their subway ticket? Or do dogs ride free?</UTT>Dogsdrinkwithundersideoftheirtongue!</UTT>Yeah I never knew that, thats pretty awesome. Do you have a dog?</UTT>I am confused how dogs in moscowusethesubway.</UTT>I do, shes 2, a rescue. I think my dog is the exception when they say dogs an elephants can understand pointing lol shes obivious.</UTT>Yes,Ilovedogs.</UTT>... DEAM–CoRef_Inconst.,Contr.,Irrel. ...</UTT>My friend’s dog knows like 50 commands. Even dancing to them. LOL!</UTT>The tongue dog is drinking from the underside!</UTT>LOL, that’s weird, it looks weird too.</UTT>I’mlooking for ways you use the subway in Moscow.</UTT>Same here. That’s really impressive, but I’m not sure how they know which subway to take.</UTT>I also don’t know why they need to use the subway.</UTT>probably because it gets around faster. And where do they keep their subway tickets? or free dog rides.</UTT>Great question. I love how they experience the eureka moment.</UTT>I never knew that. That was pretty awesome. Do you have a dog?</UTT>I don’t have a dog, do you?</UTT>I did. She was rescued at age 2. I think my dog was the exception when they said, laughing out loud, "The dog and elephant can understand."</UTT>That’s funny,doyouconsiderthedogonthesubwaymy bestfriend? Idon’thaveadog,doyouoweit?</UTT>... Table 4.6: Examples of incoherent conversations resulted from different applied manipulation techniques 40 Text–>AMR–>TextExamples Original Sentence: I do. Tim Duncan did not go to the NBA until he finished college. (m / multi-sentence :snt1 (d / do-02 :ARG0 (ii / i)) :snt2 (g / go-02 :polarity :ARG0 (p / person :name (n / name :op1 "Tim" :op2 "Duncan")) :ARG4 (t / team :name (n2 / name :op1 "NBA")) :time (u / until :op1 (f / finish-01 :ARG0 p :ARG1 (c / college))))) Back-translated Sentence: I do. Tim Duncanwon’t go to the NBA until hefinishes college. Original Sentence: Nice. He was really hated in 2012 when he decided to join MIami Heats (m / multi-sentence :snt1 (n / nice-01) :snt2 (h / hate-01 :ARG1 (h2 / he) :degree (r / really) :time (d / date-entity :year 2012 :time-of (d2 / decide-01 :ARG0 h2 :ARG1 (j / join-up-02 :ARG0 h2 :ARG1 (t / team :name (n2 / name :op1 "Miami" :op2 "Heats"))))))) Back-translated Sentence: Nice. Hereallyhated in 2012 when he decided to join the Miami Heats. Original Sentence: Yes the guy is set for life, trust me. Do you like LeBron? (m / multi-sentence :snt1 (s / set-02 :ARG1 (g / guy) :ARG2 (l / life) :ARG1-of (t / trust-01 :mode imperative :ARG0 (y / you) :ARG2 (ii / i))) :snt2 (l2 / like-01 :ARG0 y :ARG1 (p / person :name (n / name :op1 "LeBron")) Back-translated Sentence: Trust me,guysare set in life,likeLeBron? Original Sentence: Wow is he not a huge part of the show though? (h / have-concession-91 :ARG1 (h2 / have-part-91 :polarity - :ARG1 (s / show-04) :ARG2 (h3 / he) :mod (h4 / huge) :mod (w / wow :mode expressive))) Back-translated Sentence: Wow,buthe’snot a huge part of the show. Original Sentence: They use Blue Tube to share law enforcement videos. ((u / use-01 :ARG0 (t / they):ARG1 (p / publication :name (n / name :op1 "Blue" :op2 "Tube")):ARG2 (s / share-01 :ARG0 t :ARG1 (v / video :topic (e / enforce-01 :ARG1 (l / law))))) Back-translated Sentence: Theyused the Blue Tube to shareavideo about law enforcement. Original Sentence: you have to tame them they emulate the owner. (o / obligate-01 :ARG1 (y / you) :ARG2 (t / tame-01 :ARG0 y :ARG1 (t2 / they :ARG0-of (e / emulate-01 :ARG1 (p / person :ARG0-of (o2 / own-01)))))) Back-translated Sentence: You have to tame thembyemulatingtheowner. Table 4.7: Examples of original sentences, their linearized AMR graphs and back-translated sentences indicated with green, black and blue colors respectively. Bold parts of the sentences demonstrate the syntactical changes resulted from AMRs that usually do not cover such informa- tion. 41 Chapter5 CommonsenseEvaluationofOpen-DomainDialogue Systems The abundance of dialogue corpus (Dziri et al., 2018) and the development of neural models (Rad- ford et al., 2019; Lewis et al., 2020) enable open-domain dialogue systems to generate grammatically correct and meaningful responses (Zhang et al., 2020c; Bao et al., 2021; Ghazarian et al., 2021). De- spite the success, systems still struggle to consistently produce commonsense-compliant responses as humans do. As shown in Figure 5.1 Example A, the generated response is not compliant with commonsense since “need an oxygen mask” is not a reasonable prerequisite for “like to paint”. Commonsense issues for dialogue systems can also be manifested when we consider the dialogue history. For instance, in Figure 5.1 Example B, the system’s response “Thatisinteresting!” after the user talks about their car accident violates commonly accepted social norms (Frischmann, 2021). As it is clear, commonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. In this chapter, we will take the first steps in studying commonsense evaluation of open-domain dialogue systems by focusing on event commonsense that considers events and their relations and is crucial in both dialogues and general commonsense reasoning. We will propose a trainable event commonsense evaluator that extracts event-relation tuples from the utterance and assigns scores based on their compatibility with the commonsense knowledge bases. 5.1 EventCommonsense We investigate event commonsense (Sap et al., 2020; Rashkin et al., 2018) evaluation as the most important component of the overall commonsense evaluation that takes into consideration events and their relations. Our focus on event commonsense is motivated by the following three observations: First, advanced open-domain dialogue systems have been pre-trained on large corpus and thus suffer less from factoid commonsense issues (Petroni et al., 2019). Second, events and their relations are key components of commonsense reasoning (McCarthy and Hayes, 1981), and our study shows overall commonsense and event commonsense are highly correlated. Third, event 42 PersonX be allergic to paint PersonX creates art PersonX likes to paint HinderedBy to learn to paint xEffect to get a paint brush xNeed CSKB PersonX’s family mourns sad PersonX loses arms from a car accident oEffect to help PersonX oWant scared oReact CSKB Human: I boated around the world with my husband when we were younger. PLATO-2: I love boating. I also like to paint, I just need an oxygen mask. Human: I actually lost my arm from a car accident so I do not have to wear uniform at work. DiSCoL: That is interesting! Do you know Apple has anything to say on that? Example A Example B Figure 5.1: Examples of nonsensical system responses in open-domain dialogues. commonsense aligns well with the interactive nature of open-domain dialogue systems (Huang et al., 2020b) to complete certain social goals. 5.2 Method We present ACCENT (Ghazarian et al., 2023), as a framework for event commonsense evaluation. Figure 5.2 gives an overview of ACCENT with two major components. 5.2.1 SymbolicIntermediateRepresentation ACCENT uses event-relation tuples as the symbolic intermediate representation. Each tuple contains a head event and a tail event which are connected through an event relation. We formally define events and relations below. Event Following Pustejovsky et al. (2003), we define events as short phrases with a trigger word and its arguments (e.g., I like to paint). To better align with ATOMIC 20 20 , we normalize the event by replacing tokens referring to people withPerson variable (e.g., PersonX likes to paint). Relation We select ˜ R ={xIntent, xWant, oWant, xReact, oReact, xNeed, xAttr, xEffect , oEffect , HinderedBy, IsAfter, HasSubEvent} 1 from ATOMIC 20 20 relations. These relations cover human behaviors, i.e., motivation, want, reaction, need, description, towards events (Sap et al., 2019), the 1 “x” and “o” pertain to PersonX and other person(s). 43 Event-Relation Extractor Dynamic CSKB (COMET) Dialogue History Target Response 𝒕𝒕 𝒏𝒏 𝒏𝒏 𝒏𝒏 𝒏𝒏 𝟏𝟏 … 𝒕𝒕 𝒏𝒏 𝒏𝒏 𝒏𝒏 𝒏𝒏 𝒌𝒌 𝒇𝒇 ( 𝒉𝒉 𝒏𝒏 , 𝒓𝒓 𝒏𝒏 , 𝒕𝒕 𝒏𝒏 | 𝑪𝑪) ( 𝒉𝒉 𝒏𝒏 , 𝒓𝒓 𝒏𝒏 , 𝒕𝒕 𝒏𝒏 ) Query Max Similarity 𝒕𝒕 𝟏𝟏 𝒏𝒏 𝒏𝒏𝒏𝒏 𝟏𝟏 … 𝒕𝒕 𝟏𝟏 𝒏𝒏 𝒏𝒏𝒏𝒏 𝒌𝒌 𝒇𝒇 ( 𝒉𝒉 𝟏𝟏 , 𝒓𝒓 𝟏𝟏 , 𝒕𝒕 𝟏𝟏 | 𝑪𝑪) ( 𝒉𝒉 𝟏𝟏 , 𝒓𝒓 𝟏𝟏 , 𝒕𝒕 𝟏𝟏 ) Average Event Commonsense Score Step 1: Event-Relation Extraction Step 2: Compatibility Test Figure 5.2: The overview of ACCENT. Given the target response and its dialogue history, AC- CENT first extracts the event-relation tuples. Then, the compatibility test (detailed illustration in Figure 5.4) assigns a score to each tuple: ACCENT queries the dynamic CSKB, i.e., COMET, withℎ and𝑟, and generates𝑘 events. The compatible score is the maximum similarity between the ground-truth𝑡 and the𝑘 generated events{𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 . Scores for all tuples in a response are averaged to obtain the event commonsense score for the target response. cause-effect and constraint in force dynamic (Talmy, 1988), the temporal information, as well as the parent-child relation in event hierarchy. 5.2.2 Event-RelationExtraction The input of the event commonsense evaluation task is a list of utterances{𝑢 0 ,𝑢 1 ,···,𝑢 𝑛−1 } representing the dialogue history and the target response𝑢 𝑛 . ACCENT first converts the free- form text into event-relation tuples. To retain the information in𝑢 𝑛 , ACCENT extracts tuples whose head and tail events are both from the target response (denoted as “Single”). Besides, to capture event commonsense issues conditioned on the dialogue history (e.g., Figure 5.1 Example B), ACCENT also extracts tuples whose two events come from𝑢 𝑛 and𝑢 𝑛−1 respectively (denoted as “Pair”). As illustrated in Figure 5.3, the event-relation extractor in ACCENT is a T5 modelM (Raffel et al., 2020b) guided to generate the head and tail events via designed prompts for each relation. ACCENT concatenates the prompt for𝑟∈ ˜ R and the dialogue as the input and fine-tunes M in a low resource setting. When the relation𝑟 exists in the input utterances, the fine-tuned M is expected to generate the head and tail events following a particular format,i.e., “event1: {head}; event2: {tail}”, so that the tuple can be parsed from the decoded sequence (from Block A to Block B in Figure 5.3). Otherwise, the fine-tuned M is expected to output “None”. For each𝑟∈ ˜ R, the designed prompt explains the semantic meaning of𝑟 and triggers the model to generate the head and tail events (the prompts are included in Table 5.1). At the inference time, we queryM with prompts for each𝑟 and parse the generated outputs to getℎ and𝑡 to construct tuples. 44 Relation SemanticMeaning DesignedPrompt (Extract event1 and event2 from the text where ...) xIntent because PersonX wanted event2 shows PersonX’s intent for event1. xNeed but before, PersonX needed event2 needs to be true for event1 to take place. xReact as a result, PersonX feels event2 shows how PersonX reacts to event1. oReact as a result, Y or others feels event2 shows how PersonY reacts to event1. xWant as a result, PersonX wants event2 shows what PersonX wants after event1 happens. oWant as a result, Y or others wants event2 shows what PersonY wants after event1 happens. xAttr X is seen as event2 shows how PersonX is viewed as after event1. xEffect as a result, PersonX will event2 shows the effect of event1 on PersonX. oEffect as a result, Y or others will event2 shows the effect of event1 on PersonY. HinderedBy can be hindered by event1 fails to happen because event2. IsAfter happens after event1 happens after event2. HasSubEvent includes the event/action event1 includes event2. Table 5.1: Semantic meanings and designed prompts for the selected ATOMIC 20 20 relations. The semantic meanings are from Hwang et al. (2021). 5.2.3 CompatibilityTest After extracting event-relation tuples, ACCENT checks whether these tuples are sensible through a compatibility test. Denoting the CSKB asC, the compatibility test aims to learn a scoring function 𝑓 based onC, where𝑓((ℎ,𝑟,𝑡)|C) represents the compatibility of the target tuple(ℎ,𝑟,𝑡) with the CSKBC. we propose to score(ℎ,𝑟,𝑡) by querying a dynamic version ofC withℎ and𝑟. Figure 5.4 gives an example of the whole process. Specifically, ACCENT uses COMET (Bosselut et al., 2019) as the dynamic CSKB. COMET adapts the pre-trained language model by fine-tuning it on C through a conditional generation task where “{head} {relation} [GEN]” is the source and a tail event is the target. To score(ℎ,𝑟,𝑡), we query the model by requiring it to generate𝑡 𝑔𝑒𝑛 given “{ℎ} {𝑟} [GEN]”. The beam search method is applied for decoding, so we obtain a set of generated tail events,{𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 , where𝑘 is the beam size. The compatibility score for(ℎ,𝑟,𝑡) is then computed by checking the similarity between𝑡 and the most similar𝑡 𝑔𝑒𝑛 among{𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 : 𝑓((ℎ,𝑟,𝑡)|C)= max 1≤𝑖≤𝑘 cos(embed(𝑡),embed(𝑡 𝑖 𝑔𝑒𝑛 )) (5.1) Here,embed(·) is parameterized by a Sentence-BERT model (Reimers and Gurevych, 2019). After getting the compatibility scores for each extracted tuple, we average them to get the final score for the target response (see Figure 5.2). We construct the first event commonsense evaluation dataset for open-domain dialogues through crowdsourcing on Amazon Mechanical Turk (MTurk). In this section, we describe the collection procedure and the details of the dataset. 45 xReact Extract event1 and event2 from the text where event2 shows how PersonX reacts to event1. xEffect Extract event1 and event2 from the text where event2 shows the effect of event1 on PersonX. Designed Prompts T5 Open-domain Dialogue Dialogue History: I want to know whether you will come to the interview. So have you accepted offers from other companies? Target Response (PLATO-2): I have not heard back from the other companies. I am proud of it, but do you have any other offers? Prompt for Each Relation: Decode (PersonX not hears back from the other companies, xReact, PersonX feels proud) (PersonX not hears back from the other companies, xEffect, PersonX asks for other offers) event1: PersonX not hears back from the other companies; event2: PersonX asks for other offers event1: PersonX not hears back from the other companies; event2: PersonX feels proud Parse Extracted Event-Relation Tuples xReact xEffect (A) (B) Figure 5.3: Illustration for event-relation extraction. For each relation𝑟∈ ˜ R, we use its correspond- ing prompt to guide the model to generateℎ and𝑡. The final tuple is parsed from the generated output. 5.2.4 DialogueDataPreparation We select dialogue histories from DailyDialog (Li et al., 2017c), PersonaChat (Zhang et al., 2018c), and TopicalChat (Gopalakrishnan et al., 2019b) human-human dialogues. The dialogue history is limited to at most 4 consecutive utterances. Since human utterances barely contradict event com- monsense, to better evaluate machine generated dialogues, we collect responses using advanced dialogue systems, DialoGPT (Zhang et al., 2020c), PLATO-2 (Bao et al., 2021), DiSCoL (Ghazarian et al., 2021). To ensure most samples contain events and are meaningful for event commonsense evaluation, we filter samples using the following criteria: (1) the response contains at least 5 words; (2) the response contains at least 1 non-interrogative sentence 2 ; (3) the response is more than a courtesy (e.g., “It’s been nice chatting with you.”) 3 . After filtering, we randomly select 300 samples and 2 We check this by finding sentences that are not ended with a question mark (“?”). 3 These responses are manually filtered out. 46 sim=0.33 sim=0.21 Dynamic CSKB (COMET) (PersonX not hears back from the other companies, xReact, PersonX feels proud) 𝒕𝒕 𝒈𝒈 𝒈𝒈𝒈𝒈 𝟏𝟏 : frustrated 𝒕𝒕 𝒈𝒈 𝒈𝒈𝒈𝒈 𝟐𝟐 : disappointed 𝒕𝒕 𝒈𝒈 𝒈𝒈𝒈𝒈 𝒌𝒌 : unhappy sim=0.25 𝒇𝒇 𝒉𝒉 , 𝒓𝒓 , 𝒕𝒕 𝑪𝑪 = 𝟎𝟎 . 𝟑𝟑𝟑𝟑 Query with 𝒉𝒉 and 𝒓𝒓 Max Similarity 𝒕𝒕 Figure 5.4: An example of compatibility test. We query the dynamic CSKB withℎ and𝑟, and the compatibility score is the maximum similarity between𝑡 and the generated tail events ({𝑡 𝑖 𝑔𝑒𝑛 } 𝑘 𝑖=1 ). oEffect 6.8% oReact 10.7% oWant 4.3% xEffect 15.2% xReact 4.5% xIntent 7.9% xWant 6.2% HinderedBy 4.3% IsAfter 13.1% HasSubEvent 9.6% xNeed 7.5% xAttr 9.9% Cause-Effect Reaction Temporal Want Parent-Child Description MotivationConstraint Need Figure 5.5: Relation distribution in DECO test set. split them into 200 for training and 100 for testing. We name this datasetDECO (DialogueEvent Commonsense Dataset). 5.2.5 TupleExtraction To train the event-relation extractor of ACCENT, we collect human extracted event-relation tuples from DECO training set. Annotators are shown with the target response, the dialogue history, a specific relation, and are requested to compose event-relation tuples. They are allowed to tick “I cannot find any tuple” if no tuple can be found. We also request them to select whether the tuple belongs to “Single” or “Pair” (defined in section 5.2.2) for each tuple they extract. We launched HITs 4 for relations in ˜ R repeatedly until we obtained at least 20 tuples for each relation. In order to ensure the test set is comprehensive, we particularly request annotators to compose tuples for all 12 relations in ˜ R (100 samples× 12 relations in total). A separate validation round was conducted to check whether each extracted tuple satisfies (1) the head and tail are events, (2) the head and tail come from𝑢 𝑛 or𝑢 𝑛−1 , (3) the relation between the head and tail can be inferred from the dialogue. A tuple is deemed valid if the majority of 3 annotators vote “yes”. After removing invalid tuples (the dialogue numbers remain unchanged), we collected 307 tuples for training and 467 tuples from the DECO test set. Figure 5.5 shows the relation distribution in the densely annotated test set. 4 HIT is an assignment unit on Amazon MTurk. 47 5.2.6 CommonsenseScoring We instruct annotators to score target responses in terms of event commonsense by focusing on the events and their relations. Each response was annotated by 3 individual annotators with a scale of 1 to 5. Following Mehri and Eskenazi (2020), we measure the inter annotator agreement (IAA) by correlating each annotation with the mean of the other annotations for the same sample, and the Spearman correlation is 0.578 showing an acceptable agreement. 5 The final event commonsense score assigned to each sample is the average of 3 individual ratings. We also requested the annotators to judge the overall commonsense of a dialogue response before introducing event commonsense to annotators. Among the 900 annotation pairs we col- lected, the Spearman correlation between the two scores reaches 0.862, which indicates that event commonsense is a key component in overall commonsense reasoning. 5.2.7 AdditionalHuman-MachineDialogues We further explore the generalization ability of ACCENT on responses with human-machine dialogue histories. We select 100 samples from ConTurE (Ghazarian et al., 2022a), a turn-level evaluation dataset, to annotate event commonsense scores. We denote this dataset asConTurE Subset. 5.2.8 Setups We compare ACCENT with baseline methods for event commonsense evaluation and also examine its two components separately. Therefore, our experiments include three setups for the evaluation: Setup1(MetricsPerformance) Our main goal is to evaluate the commonsense metric, and we achieve this by computing the correlation between automatic scores and human judgments. ACCENT and baseline metrics are tested on DECO test set and ConTurE Subset. Setup 2 (Event-Relation Extraction) We evaluate the performance of the event-relation extraction component of ACCENT by comparing the automatically extracted tuples with human extracted tuples on DECO test set. We view checking whether a tuple with relation𝑟 is extracted from the utterances𝑢 𝑛 and𝑢 𝑛−1 as a binary classification problem and compute the F1 score. We also measure how “close” the automatically extracted head and tail events are to human extraction results. We convert the tuple into a sentence by concatenating the head and tail, and then compute BLEU-2 (Papineni et al., 2002b) and BERTScore (Zhang et al., 2020b). Setup3(CompatibilityTest): The compatibility test component of ACCENT can be viewed as a tuple scoring task. We compare our proposed approach with other tuple scoring methods on a large-scale benchmark (Fang et al., 2021a) which contains event-relation tuples with 0 (compatible to a given CSKB) or 1 (not compatible to the CSKB) scores. Since the training relations in this benchmark differ from relations supported by the off-the-shelf COMET, we train our own COMET on its training set to make our compatibility test component applicable to this test set. This benchmark dataset covers all 12 relations in ˜ R as well as 6 more relations. 5 0.40-0.69 implies strong relationship. 48 5.2.9 Baselines We compare ACCENT with 5 baseline metrics: (1, 2)FED-understandable/appropriate (Mehri and Eskenazi, 2020) are two off-the-shelf baselines. “Understandable” and “Semantically Appropri- ate” are closer to commonsense compared to the rest of the criteria in FED. (3)Cross-encoder is a widely used model for sentence-pair regression tasks. We use BART (Lewis et al., 2020) as the backbone. (4)Cross-encoder(COMET) is a variant of (3) with COMET trained on ATOMIC 20 20 as the backbone. (5) MLP regressor (Zhou et al., 2021b) is trained with neural features from DialoGPT and symbolic features from ConceptNet. The cross-encoders and the MLP regressor require event commonsense scores to train the model in an end-to-end manner. We use the annotated scores in DECO training set to train them, and split 20% data for validation to conduct hyperparameter search. For Setup 2, we consider the following baseline approaches: (1) ASER Extractor (Zhang et al., 2020a) first extracts events through patterns from dependency parsing and then uses a neural classifier to predict the relation. (2) CSKBSearch (Zhou et al., 2021b) searches the one-hop neighbors in ATOMIC 20 20 through keyword matching. For Setup 3, we consider 4 tuple scoring baselines. These baselines convert a tuple to an embedding and train a binary classifier to give score: (1) Bert feedsℎ,𝑟,𝑡 toBert and concatenates their [CLS] embeddings to get the tuple embedding. (2)BertSAGE (Fang et al., 2021b) further concatenates the average embedding of the neighbors ofℎ and𝑡 in an event knowledge graph. (3) KG-Bert (Yao et al., 2019) inputs “[CLS],ℎ, [SEP],𝑟, [SEP],𝑡” to get the tuple embedding. (4) KG-BertSAGE (Fang et al., 2021a) further concatenates the average embedding of neighboring nodes. We use RoBERTa LARGE (Liu et al., 2020) as the backbone which has roughly the same parameter budget with COMET to have a fair comparison. 5.2.10 ACCENTImplementation The proposed ACCENT framework is implemented using the Transformers library (Wolf et al., 2020). For event-relation extraction, we fine-tune T5-base 6 for 50 epochs with the batch size of 4 and the learning rate of 5e-5. The training data comes from the human extracted tuples from DECO training set. We additionally select 5 negative samples (dialogues that do not have a certain relation) per relation from the training set and set their target output as “None” to guide the model to handle cases which do not contain a certain relation. During inference, if no tuple is extracted after considering all relations, we assign a score of 0.5 to the sample. For compatibility test, we use the off-the-shelf COMET model trained on ATOMIC 20 20 (Hwang et al., 2021) 7 . When querying COMET through generation, we use beam search with a beam size of 10 to get commonly sensible tail events. The embed(·) in Equation (5.1) is parameterized by paraphrase-MiniLM-L6-v2 provided in the Sentence-Transformers library 8 . 6 huggingface.co/t5-base 7 github.com/allenai/comet-atomic-2020 8 huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2 49 DECO ConTurE 𝛾 𝜌 𝛾 𝜌 FED-appropriate -0.16 -0.10 -0.09 -0.04 FED-understandable -0.12 -0.07 -0.08 -0.04 Cross-encoder 0.15 0.15 -0.05 -0.09 Cross-encoder (COMET) 0.17 0.17 0.00 0.00 MLP Regressor 0.11 0.01 0.17 0.16 ACCENT (Ours) 0.30 0.30 0.21 0.22 Table 5.2: Pearson (𝛾) and Spearman (𝜌) correlations between human judgments and different automatic evaluation metrics. The results for ACCENT are all significant ( 𝑝 < 0.05). DialogueHistory TargetResponse Human ACCENT ExtractedTuple Yes, I run them all the time! How exciting, are you ready? I’m ready for the marathon! So I’m going to be running a half marathon in a few weeks. 0.80 0.61 (PersonX is ready for the marathon, xEffect, PersonX run a half marathon) I like pizza as long as it’s a meat lover. I hate vegetables. Lol, yes it is! Great for you to eat so healthy! 0.53 0.45 (PersonX hate vegetables, xAttr, PersonX be healthy) I had an accident. That’s what I like to call the most beautiful thing. 0.27 0.23 (PersonX has an accident, oWant, PersonY wants to call the most beautiful thing) Table 5.3: Examples of evaluation scores given by ACCENT from the DECO test set. Only𝑢 𝑛−1 in the dialogue history is shown, and the human annotations are normalized to the range of(0,1) for better demonstration. “Extracted Tuple” column shows tuples extracted in ACCENT framework which account for the given scores. 5.2.11 MetricsPerformance Table 5.2 shows the correlations between automatic scores and human annotations. ACCENT uniformly outperforms the baselines on both two test sets. Specifically, off-the-shelf metrics (“FED-appropriate”, “FED-understandable”) perform poorly. For “Cross-encoder (COMET)”, its results show that implicitly using the CSKB through transfer learning cannot yield satisfactory performance. Besides, cross-encoders fail to generalize well to ConTurE Subset whose dialogue histories are from human-machine dialogues. For “MLP Regressor”, although it tries to utilize the CSKB explicitly, it is not as effective as ACCENT. Some examples from the DECO test set and their event commonsense scores given by ACCENT are shown in Table 5.3. These scores are close to human judgements and enjoy great interpretability owning to the extracted event-relation tuples. Apart from the sample-level correlation, we further examine whether ACCENT can reflect model performance in terms of event commonsense. Figure 5.6 shows the rankings of three dialogue systems used in DECO construction given by human and ACCENT. Human and ACCENT rank the three systems exactly the same and the two sets of averaged scores highly correlates with each other. 50 0.35 0.37 0.39 0.41 0.43 0.45 2.5 2.7 2.9 3.1 3.3 3.5 Averaged ACCENT Score Averaged Human Score DisCoL DialoGPT PLATO-2 Figure 5.6: Average event commonsense scores of generated responses of different models using human annotations (scale 1 to 5) and ACCENT automatic evaluation (scale 0 to 1). The rankings of systems given by human and ACCENT are the same. P R F1 BLEU BERTScore CSKB Search 29.9 96.3 45.7 26.9 89.8 ASER Extractor 31.5 23.6 27.0 32.4 89.3 Ours 31.4 55.0 40.0 41.6 93.5 Table 5.4: Performances of different event-relation extraction methods on DECO test set. P: Precision. R: Recall. 5.2.12 TupleExtractionPerformance Table 5.4 shows the results of Setup 2 where we evaluate the event-relation extraction performance on DECO test set. Our proposed method achieves much higher BLEU and BERTScore than two baselines, indicating that the composed events in tuples have reasonable quality. However, joint event-relation extraction remains challenging because it combines the event extraction and relation identification. Although our proposed method has higher score than ASER Extractor by F1, it still has plenty of room for improvement. it usually returns a lot of tuples, thus resulting in high recall and very poor precision. Also, searching CSKB is not applicable in our framework because this method can only return sensible tuples. 5.2.13 CompatibilityTestPerformance Table 5.5 depicts the test results on the benchmark dataset. Our method outperforms all baselines, and it does not require negative samples for training. The major difference between our method and those tuple scoring baselines is that we use the tuples in the existing CSKB to train a dynamic 51 Subset All Bert 62.0±0.3 61.5±0.3 BertSAGE 55.8±0.7 55.8±0.7 KG-Bert 62.6±0.7 62.3±0.8 KG-BertSAGE 63.2±0.4 62.9±0.3 Ours 68.0±0.8 67.6±0.8 Table 5.5: Test results on the CSKB compatibility benchmark. We report the overall AUC across all relations (“All”) and the AUC across samples with our target relations (“Subset”). Both the averaged metric and its standard deviation are reported over 3 runs. CSKB, i.e., COMET, instead of a discriminative model. We assume our strong results may be due to the generalization ability of the dynamic CSKB. 5.2.14 AblationStudies We conduct ablation studies to explore (1) whether the proposed event-relation extraction method can lead to better final metric performance; (2) given the automatically extracted tuples, whether the proposed compatibility test method can lead to higher correlation with human judgment. To answer (1), we compare different methods to get the event-relation tuples (Part I in Table 5.6). Among the event-relation extraction baselines, we only consider ASER Extractor because CSKB search is not applicable in our framework as discussed in section 5.2.12. Note that the event- relation extractor in ACCENT considers tuples in both “Single” and “Pair” settings to cover two potential types of errors (see section 5.2.2). To verify this, we compare the variations of our proposed method where we only use tuples marked as “Single” or “Pair” for model training. Also, the human extracted tuples in DECO test set are used to provide an upper bound. To answer (2), we fix the event-relation extraction part and change the compatibility test part (Part II in Table 5.6). We considerBert andKG-Bert trained on the CSKB compatibility benchmark because they do not need event graph information and can be seamlessly applied to our compatibility test. Also, while we query COMET through tail generation, another intuitive design is using the model loss with “{h} {r} [GEN]” as the source and𝑡 as the target to give scores. We map the loss to(0,1) through an exponential function, and name this alternative as “COMET (neural)” for it skips the symbolic decoding of𝑡 𝑔𝑒𝑛 . Table 5.6 demonstrates that the whole ACCENT gives the best result. Considering the vari- ations of our design, “w/o Pair” gives much lower results, indicating that limiting the symbolic intermediate representation to only the information contained in the target response is not enough. This observation is in accord with our finding that some event commonsense errors occur when we take the dialogue history into account. 52 DECO ConTurE ACCENT(whole) 0.30 0.22 I ASER Extractor 0.14 0.00 w/o Pair 0.19 0.08 w/o Single 0.24 0.18 Gold Tuples 0.42 - II Bert -0.08 0.09 KG-Bert 0.13 0.19 COMET (neural) 0.16 0.05 Table 5.6: Ablation results measured by Spearman correlation. I: Ablation of the event-relation extraction part. The gray row shows the results using human extracted tuples which provides an upper bound. II: Ablation of the compatibility test part of ACCENT. STSAvg. DECO ConTurE Sentence-BERT 79.82 0.30 0.22 DiffCSE 9 78.21 0.12 0.25 ESimCSE 10 77.44 0.19 0.24 Sup-SimCSE 11 82.52 0.31 0.26 Table 5.7: Results with different sentence embedding methods measured by Spearman correlation. Following Gao et al. (2021), we use the average results on the semantic textual similarity (STS) tasks to reflect the sentence embedding performance. 53 Pearson Spearman COMET (neural) 0.14 0.25 ACCENT approach 0.40 0.42 Table 5.8: Correlations between human judgments and different compatibility test approaches with human-extracted tuples on DECO test set. Dialogue Tuple Human COMET (nueral) ACCENT approach A: I work in the bakery and eat all my favorite cupcakes. What do you do? B:IactuallyjustgotfiredforamistakeImade. (PersonX makes a mistake, xEffect, PersonX gets fired) / 0.33 0.63 (PersonX gets fired, isAfter, PersonX makes mistake) / 0.12 0.68 (PersonX gets fired, HasSubEvent, PersonX makes mistake) / 0.18 0.66 Average 0.80 0.21 0.66 A: Yeah winter is coming soon. It gonna be really cold. B:IknowIknow. Iwanttoliveinacoldplace beforeIgofullonwinter. (PersonX wants to live in a cold place, xIntent, PersonX intends to go full on winter) / 0.06 0.59 (PersonX goes full on winter, xNeed, PersonX lives in cold place) / 0.53 0.39 (PersonX goes full on winter, isAfter, PersonX lives in a cold place) / 0.95 0.55 (PersonX knows winter is coming, HasSubEvent, PersonX wants to live in a cold place) / 1.63 0.69 Average 0.40 0.79 0.56 Table 5.9: Examples of results given by different compatibility test approaches. Only 𝑢 𝑛−1 in the dialogue history is shown. Human annotations are normalized to the range of (0, 1) and scores given by “COMET (neural)” are scaled up by 100 times for better demonstration. Compatibility scores for each tuple are averaged to get the final score for the target response (in bold font). Another empirical discovery is that although “COMET (neural)” is a direct way of using the dynamic CSKB, its performance is poorer than what we propose in ACCENT. We assume that comparing𝑡 and𝑡 𝑔𝑒𝑛 in a symbolic fashion can yield more comparable scores among tuples with different relations. ACCENT checks whether an event-relation tuple(ℎ,𝑟,𝑡) is compatible with the commonsense knowledge by comparing the similarity between 𝑡 and commonsense tails generated by the Dynamic CSKB (COMET). Ablation results in Table 5.6 show that the compatibility test approach in ACCENT yields better performance than the “COMET (neural)” alternative which also uses the COMET model. To exclude the potential noise introduced by the automatically extracted tuples, we further compare these two methods using human-extracted tuples on DECO test set. Results in Table 5.8 demonstrate that the conclusion still holds under this experimental setting. Table 5.9 gives two samples with a breakdown of tuple results. Compared with the compatibility scores given by ACCENT approach, the scores given by “COMET (neural)” are less comparable among tuples with different relations, thus making this method unsuitable for ACCENT framework. 9 https://huggingface.co/voidism/diffcse-roberta-base-sts 10 https://huggingface.co/ffgcc/esimcse-roberta-base 11 https://huggingface.co/princeton-nlp/sup-simcse-roberta-base 54 In our implementation of ACCENT, the comparison of𝑡 and𝑡 𝑔𝑒𝑛 is done by calculating the cosine similarity between their Sentence-BERT embeddings. We further experiment with other sen- tence embedding methods based on contrastive learning. Specifically, we consider DiffCSE (Chuang et al., 2022), ESimCSE (Wu et al., 2022) which are two unsupervised contrastive learning frame- works for learning sentence embeddings. We also consider Sup-SimCSE (Gao et al., 2021) which leverages annotated natural language inference datasets by using “entailment” pairs as positives and “contradiction” pairs as hard negatives in the contrastive learning objective. As shown in Ta- ble 5.7, ACCENT can benefit from the improvement of the sentence embedding method, i.e., using Sup-SimCSE (Gao et al., 2021). We support both Sentence-BERT and Sup-SimCSE in our released ACCENT codebase. 5.2.15 ErrorAnalysis Since ACCENT is a pipeline framework, there is likely error propagation. In section 5.2.14, we rule out the errors introduced by the event-relation extraction component by using human- extracted gold tuples. Results show that ACCENT with gold tuples (see “Gold Tuples” in Table 5.6) gives higher correlation with human judgment than “ACCENT (whole)” which uses the model- extracted tuples, indicating that ACCENT can benefit from high quality symbolic intermediate representation. We conduct a qualitative analysis of the event-relation component in ACCENT. Table 5.10 shows some examples of the extracted tuples. While most of the head and tail events are nicely composed and capture the major information in the given text, they are not perfect. Multiple participants involved in the dialogue further increase the difficulty of the task. We note that the model sometimes confuses the multiple participants in the dialogue and makes mistakes when using “PersonX” or “PersonY”. For example, in the third sample of Table 5.10, the model confuses different participants since the subjects of “raise money” and “go tomorrow for treatment” should be different. Such confusion will lead to the wrong tuples which cannot truly reflect the meaning of the dialogue. Also, identifying relation from the given dialogue is challenging. Although we include negative samples (dialogues that do not have a certain relation) when fine-tuning T5, errors still exist (e.g., the tuple with “xAttr” relation in the third sample of Table 5.10). 5.3 Conclusion We presented ACCENT, an automatic evaluation metric for event commonsense evaluation of open-domain dialogue systems. We showed that by learning event-relation tuples as the symbolic intermediate representations in a low-resource prompt-based setting, ACCENT can effectively utilize the CSKB. ACCENT estimates the compatibility of the extracted tuples with the used CSKB. It can achieve a decent correlation with human judgments for dialogue commonsense evaluation and outperform end-to-end baselines. 55 Automaticallyextractedevent-relationtuples Dialogue History: A: Cool. Are you religious? I have not told my parents I am wiccan yet. B: I am the wife of a pastor. Spirituality is important in our home. A: Oh. I grew up in a un religious home. B: God is always around, I hope he does not mind that I do not like cleaning. Target Response: A: I’m not religious. I’m a Christian. (PersonX not be religious, xAttr, PersonX be a Christian), (PersonX grew up in un religious home, xEffect, PersonX be a Christian), (PersonX like cleaning, isAfter, PersonX be a Christian), (PersonX like cleaning, HasSubEvent, PersonX be a Christian), (PersonX like cleaning, HinderedBy, PersonX be a Christian) Dialogue History: A: I’m dreadfully sorry for you. B: Thank you very much for your kindness. A: Is there any clue? Target Response: B: If I had a clue, I’d be happy to help. (PersonX have a clue, xAttr, PersonX be happy to help), (PersonX has a clue, xReact, PersonX be happy to help), (PersonX has a clue, xEffect, PersonX be happy to help), (PersonX asks PersonY for a clue, oEffect, PersonY is happy to help), (PersonX help PersonY, HinderedBy, PersonX have a clue) Dialogue History: A: Yeah I know, he did a great job for the limited time, can you believe that 20th Century Fox let George Lucas keep the merchandising rights, that’s what made him a billionaire, so crazy I don’t think they would do that ever again. B: I think I can believe it. Good for him, what does the corporation need with something they didn’t invent anyways? A: No I agree with you. Are you going to watch the Academy Awards by the way? They give out 24 Awards and I think that the first broadcast was back in 1930, which is wild B: Yeah, so like the academy awards totally suck. Does anyone actually give a bleep what they think? Target Response: A: Well I can’t speak for the other commenters but I’m sure there are people who give them a bleep can help winners go on to win awards and make money. (PersonX give a bleep, HasSubEvent, PersonX help winners go on to win awards) Table 5.10: Some examples of the automatically extracted event-relation tuples. These tuples function as the intermediate symbolic representation in ACCENT framework. 56 Chapter6 ControllableDialogueGenerationthroughAutomatic EvaluationMetrics In the previous chapters, we developed automatic dialogue evaluation metrics for assessing dialogues at both turn and conversation levels. We focused on different approaches of generating training data and model architectures to better train these metrics as quality classifiers. Conventionally, researchers resort to automatic evaluation metrics in the last developmental stage of the proposed dialogue models to test their success. While these metrics can be leveraged not only for testing but also for training dialogue systems. In this chapter, we examine the role of evaluation metrics in training controllable dialogue models. 6.1 ControllableDialogueGeneration Given the rapid development of natural language generation models and particularly the grand achievement of Large Language Models (LLMs) (Vaswani et al., 2017; Devlin et al., 2019; Zhang et al., 2020c; Chiang et al., 2023; Taori et al., 2023), there is a high demand to establish control over the generated outputs to contain particular attributes that meet specific constraints. To this end, ControllableTextGeneration (CTG) models are emerged and employed in a variety of tasks such as toxicity avoidance (Liu et al., 2021; Lu et al., 2022; Kim et al., 2023), sentiment control (Liu et al., 2021; Ziegler et al., 2019; Lu et al., 2022; Kim et al., 2023), topic control (Yang and Klein, 2021a; Kim et al., 2023), formality change in machine translation (Yang and Klein, 2021a), style transfer (Upadhyay et al., 2022), topic preserved generation of synthetic news (Mosallanezhad et al., 2021), etc. However,ControllableDialogueGeneration (CDG) that is capable of controlling conversational model outputs to be faithful, informative, relevant, and engaging has received slight attention (Hu et al., 2022; Daheim et al., 2023). These control factors are beneficial to have dialogue systems with convincing performance that make the conversation a pleasant experience for humans. To steer dialogue models through different explicit or implicit signals and attributes, one approach is to train LLMs in RL setup by getting human feedback as rewards to impose the desired attributes and constraints. This approach relies on human feedback which enhances an extra burden on humans for collecting those feedback (Bai et al., 2022). On the other hand, training 57 Hi, do you like cosmetics? Yes, I do. Yes, I do like cosmetics. I enjoy trying out new products and finding the right products to suit my needs. like not love . want like not love . want 𝑷 𝑳𝑳𝑴 (𝒚 𝟓 |𝒙,𝒚 $𝟓 ) 𝑷 % 𝑪𝑫𝑨𝑰𝑭 (𝒚 𝟓 |𝒙,𝒚 $𝟓 ) AE Metrics Relevant Engaging CDAIF CDAIF Figure 6.1: Generated responses by LLM versus our proposed controllable generative model (CDAIF) that results in a more relevant and engaging response by altering the LLM’s original probability distribution for the highlighted token taking into account signals from automatic evaluation metrics as the control feedback. LLMs treated as the base models does not seem to be feasible due to their huge size and availability concerns. Hence, post-processing methods for CTG are better approaches in the case of using LLMs, where their parameters are fixed and only the probability distributions over the tokens in each timestep are revised during decoding-stage (Yang and Klein, 2021b; Liu et al., 2021; Meng et al., 2022). The post-processing CTG requires fewer computational resources and does not need access to the parameters of the base model. On top of post-processing methods, we propose to leverage automatic evaluation metrics as artificial intelligence feedback to control the dialogue models and decrease the overload work on humans. Our goal is to examine and study Controllable Dialogue generation using Artificial Intelligence Feedback (CDAIF) that takes advantage from both LLMs as powerful generative models and cost-less control signals provided by automatic evaluation metrics on different aspects without relying entirely on human feedback. We augment engagement (See et al., 2019b; Ghazarian et al., 2020) andrelevance (Ghazarian et al., 2019c, 2022b) as two indispensable control aspects to the decoding step since following our conducted study on the quality of the recent dialogue models they are partially missed in the generated responses. Similar to (Bai et al., 2022), we probe artificial intelligence guidance for controlling dialogue systems. We focus on automatic evaluation metrics that have acceptable correlations with human judgments to modify LLMs output distributions (see Figure 6.1) with no demand for training LLMs. Our approach is applicable to huge and closed-source LLMs. 58 6.2 ConstraintsinDialogueLLMs With the advent of transformer-based language models (Devlin et al., 2019; Vaswani et al., 2017; Lewis et al., 2020), we have witnessed big progress in different NLP downstream tasks. This progress became more tangible by the development of large language models such as GPT4, ChatGPT 1 , LLaMA, Flan-T5 with billions of parameters trained on abundant text data (Zhang et al., 2020c; Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023; Chung et al., 2022). Although many of these large models such as LLaMA (Touvron et al., 2023) and Flan-t5 (Chung et al., 2022) have been finetuned on conversational data and resulted in superior dialogue systems, they are not flawless and they possess deficiencies that should be addressed (Kaddour et al., 2023). In this work, we investigate three newly proposed and publically released chatbots based on LLMs; Vicuna (Chiang et al., 2023), Alpaca (Taori et al., 2023), and FastChat-T5 2 ; alongside common issues manifested in their generations. Vicuna 3 and Alpaca 4 are both finetuned versions of LLaMA model trained on user-shared conversations collected from ShareGPT and instruction- following demonstrations, respectively. They perform qualitatively similar to their immense counterparts GPT3 and ChatGPT, while being much lighter and open-source. FastChat-T5 is originated from Flan-T5 finetuned on user-shared conversations collected from ShareGPT. To examine the quality of generations by the aforementioned dialogue models, we collect a dataset called DETA (Dialogue Error Type Dataset) that consists of randomly selected 200 dialogue histories from Topical-Chat (Gopalakrishnan et al., 2019a) test set. We create prompts for each dialogue model following their guidelines for at most six consecutive utterances chosen from different positions in dialogues and pass them through the generative models. Overall, 600 generated responses exist in DETA. DETA is collected by three members of our lab that are familiar with the task. They check the quality of generations by selecting the type of issues from a predefined set. It includes engagement, relevance, common sense, repetition, factual hallucination, and inconsistency. These issues are marked when a response is not interesting and appealing, irrelevant to the previous dialogue utterances, not commonly sensible, repeating contents in various levels, or contradicting its prior statements. If a response is admissible they pick ’N’ exhibiting ’No issue’. To compute the inter-annotator agreement between user pairs and overall Krippendorff’s alpha coefficient, we process the collected data and convert each user’s annotation for each response to a binary list. The label in each index demonstrates whether or not that specific type of issue is present. In the case of ’No issue’, the list has all zeros. The Cohen Kappa agreements between pair of users are 0.52, 0.34, and 0.44, with Krippendorff’s alpha coefficient of 0.43 showing moderate agreement with taking into account the complexity of the task. We move forward and assign each issue to the corresponding generation if at least two annotators choose that issue. If a response is not marked with any type of issues we categorize it as a "No issue" response. The percentage of responses with and without issues generated by each 1 https://openai.com/blog/chatgpt 2 https://github.com/lm-sys/FastChat 3 https://lmsys.org/blog/2023-03-30-vicuna/ 4 https://crfm.stanford.edu/2023/03/13/alpaca.html 59 Figure 6.2: The percentage of responses with/without issue generated by Vicuna (left), Alpaca (middle), and FastChat-T5 (right) models. DialogueHistory GeneratedResponse Model Issue ... Person 1: I think I only saw that becuase my wife has a crush on him. I think he may have been shirtless on the trailer so my wife was all in on the movie. Did you like it? Person 2: I liked it, I thought it was a nice movie although many parts were unreal like when the volcano explodes and he suddenly can move again. Person 1: Yeah, that was a little far fetched. But overall, it was a good movie. Vicuna Eng ... Person 1: Elite performers do stop by, I don’t think any jazz musician can truly be considered a big name, they’re just not well known. Person 2: I couldn’t name a jazz musician. Jazz trumpeters are known to grow goatees, for comfort while playing. Person 1: Play the song that has the most votes. Alpaca Rel/Eng ... Person 1: Yes. There was once a president banana who was the 1st president of Zimbabwe! lol Person 2: Isn’t that a crazy name. Also, the president of Indonesia makes pop albums. Person 1: That’s something! You wouldn’t think he would have the time to do that while in office. Person 2: I know, Do you like to go to concerts? Person 1: I do! It’s a great way to spend the day. FastChat-T5 Eng Table 6.1: Examples of responses generated by Vicuna, Alpaca, and FastChat-T5 models alongside their type of issues indicated by majority of annotators. of three dialogue models are depicted in Figure 6.2. According to this figure, more than half of the generated responses by Alpaca and FastChat-T5 models have at least one of the predefined issues. We dive into more details and show the percentage of each model’s generations with their issue types. Figure 6.3 illustrates the severity of engagement insufficiency in all three models, followed by relevance as the second most common issue. Table 6.1 demonstrates three dialogue contexts alongside models’ responses and their specified issues. We concentrate on engagement and relevance as the two most substantial constraints and explore the effectiveness of leveraging automatic evaluation metrics to automatically calculate these aspects as the feedback used to guide LLMs. Post-processing methods that freeze the base generative models and re-rank the distribution of tokens in inference-time provide promising outputs specifically when base generative models are LLMs as we don’t need to deal with computationally expensive training procedures. We 60 Figure 6.3: Percentage of different type of issues existing in the responses generated by Vicuna, Alpaca, and FastChat-T5 models. establish our proposed CDAIF on NADO (Meng et al., 2022); a recently proposed decoding-time controllable text generative model that has shown its remarkable ability to merge the constraints in the generated outputs. NADO modifies the output probability distribution of base auto-regressive models by a trained neural model that is designed to conduct token-level guidance. Assume we have a dialogue generative model where its output is the probability distribution of response tokens given a dialogue context, and we plan to apply some predefined constraints on it to control the generations during decoding time. To achieve this goal, we leverage automatic evaluation metrics capable of measuring the engagement and relevance degree of responses given the dialogue history as a proxy to guide the dialogue model. Following NADO (Meng et al., 2022), we train a light sequence-to-sequence neural network that given the automatic evaluation metrics providing utterance-level feedback it approximates engagement and relevance of responses in the token level as these scores are essential in the decoding stage to change LLMs tokens distributions. Similar to NADO (Meng et al., 2022), we prepare a large set of training data for CDAIF by passing sample responses generated by the base generative model and scoring them using automatic evaluation metrics. We convert metrics scores to binary labels, setting 0.5 as the threshold. The boolean labels of automatic evaluation metrics are indirectly assigned to all tokens in each timestep. The small seq-to-seq model is trained by minimizing the cross entropy for the binary classification task in each step. The lightweight classifier scores the engagement and relevance of response tokens in each timestep. During inference, the probability distribution of the fixed base model and the trained small model are combined to incorporate the embedded constraints in the output. 61 6.2.1 Model In our experiments, we investigate Alpaca (Taori et al., 2023) with 7b parameters 5 and apply engagement and relevance labels to control its quality. Our approach can be easily generalized to other LLM-based dialogue models. In the training data construction of the token-level classifier, we separate input conversations by ’Human’ and ’Assistant’ special tokens to show the dialogue exchanged between interlocutors. We train CDAIF for ten epochs with batch_size of 32 and a maximum length of 1024 tokens for the generations. The trained controllable generative models using relevance and engagement scores result in cross-entropy loss of 2.27 and 2.80, respectively. Also the model trained using the combined labels from both metrics has 4.79 cross-entropy loss. 6.2.2 Data We select dialogues from Topical-Chat (Gopalakrishnan et al., 2019a) dataset to train and test the CDAIF model. In both train and test sets, we extract at most four sequential utterances as dialogue contexts from random locations of the conversations. To train the model for token-level control, we expand the train set and generate four separate responses for each dialogue history. Then, we pass the generated outputs to automatic evaluation metrics to get the sequence-level labels that are assigned to all tokens to guide the generation process. The final comparison of LLMs quality versus CDAIF is conducted on randomly selected 50 dialogue contexts from the test set of the Topical-Chat dataset. We generate four groups of responses that are the outputs of Alpaca model as the base LLM, CDAIF(eng) model that is trained on engagement labels, CDAIF(rel) model that includes relevance scores as the control signals, and CDAIF(rel_eng) model that consists of the combination of engagement and relevance labels as a more restricted guidance. We ask annotators to rate the randomly sorted outputs of Alpaca and CDAIF(eng) in the range [1-5], 1 indicating completely boring responses and 5 showing completely engaging generations. Annotators are also supposed to rate the relatedness degree of responses generated by Alpaca and CDAIF(rel) in the same range [1-5]. In the last group of annotations, we look for the overall quality of Alpaca and CDAIF(rel_eng) models. 6.2.3 AutomaticEvaluationFeedback To examine the engagement and relevance of the samples generated by the base model and use them for training CDAIF, we leverage off-the-shelf transformer-based classifiers (PredictiveEn- gagement (Ghazarian et al., 2020) and DEAM (Ghazarian et al., 2022b)) that have acceptable correlations with human judgments. Since these metrics are used to judge the relevance and engagement of the generated samples leveraged for training token-level guidance, their reliability is important. To this end, we divide DETA into three parts and we use its test portion to assess the quality of the scores predicted by the existing pretrained metrics. Table 6.2 shows the computed correlation between the binary 5 We download Alpaca from https://huggingface.co/chavinlo/alpaca-native. 62 Stage Metric Correlation (Pre) PredictiveEngagement Engagement 0.13 (p=0.21) (Pre) DEAM relevance -0.03 (p=0.76) (Ft) PredictiveEngagement Engagement 0.32 (p=1e-3) (Ft) DEAM relevance 0.50 (p=2e-7) Table 6.2: Pearson Correlations of pretrained (Pre) and finetuned (Ft) automatic evaluation metrics on the test set of DETA. Figure 6.4: The percentage of generated responses by Alpaca and CDAIF models controlled by engagement, relevance and both constraints that have higher and/or equal quality. list of human-annotated and metrics-predicted scores. The list of human scores consists of 0s if at least two annotators have marked the engagement or relevance issue for the corresponding response and 1 otherwise. The list of metrics ratings is constructed by mapping scores in the range of [0-1] to 0 if the values are less than 0.5 and 1 otherwise. The first section in Table 6.2 represents the low performance of pretrained PredictiveEngage- ment and DEAM metrics with respect to human annotations. We finetune each of these metrics on the 400 train portion of DETA. We train PredictiveEngagement for 200 epochs with a batch size of 20 and use the RMSprop optimizer. The learning rate is 1e-4. We halt the training via early stopping of 50 epochs. We train DEAM for 25 epochs with a batch size of 8, a learning rate of 2e-5, and set AdamW as the optimizer. The higher performance of finetuned automatic evaluation metrics shown in the second part of Table 6.2 encourages us to use them further along the training of CDAIF. 6.3 ResultsandAnalysis To examine the effectiveness of CDAIF using automatic evaluation metrics as the sources of feedback for each pair of responses generated by Alpaca and CDAIF models, we select those 63 with higher user ratings as the winners. If both responses have equal scores, we choose ’both’ as the winner. Then, the overall winner for each pair is the one that is preferred by at least two annotators. Figure 6.4 shows the percentage of responses that are rated higher by at least two of the annotators. The left pie chart includes the comparison between Alpaca and CDAIF(rel) models, the middle and right pie charts are regarding CDAIF(eng) and CDAIF(rel_eng). In Figure 6.4, we observe that in all three pie charts, the percentage of responses generated by conditional generative models (CDAIF) with higher human ratings is more than those generated by Alpaca. According to the left figure, human has mostly selected CDAIF(rel) model’s outputs as more relevant to the dialogue context. The middle pie chart shows that the model controlled with engagement score (CDAIF( eng)) generates more engaging responses than the base Alpaca model. The percentage of CDAIF(rel_eng) model’s responses that have overall higher quality than the base Alpaca model is much larger, showing that combining both engagement and relevance aspects enhances the overall quality of the generation by controlling the outputs to be relevant and at the same time engaging. In Table 6.3, we qualitatively show some examples of generations by Alpaca versus CDAIF refined by engagement, relevance, and both aspects. In each pair of responses generated for the same dialogue context, the last column demonstrates the winner according to humans provided ratings. 6.4 Conclusion Even though the impressive performance of LLMs has engendered a stunning revolution in natural language generation including dialogue generation, it is still imperative to execute control over LLMs from different perspectives such as relevance and engagement to conduct well-aligned conversations with human preferences. In this chapter, we presented CDAIF, which takes the advantage of efficient LLMs as the base generative models and leverages automatic evaluation metrics as costless feedback control to guide the generation, relying less on human judgments. We showed enhancements in the quality of the generated outputs of models controlled by different perspectives. 64 DialogueHistory GeneratedResponse Model Winner ... Person 1: yeah, people love him so he gets cast a lot. i suppose my favorite film of his is shawshank. that was great. which film of his do you like? Person 2: OMG! Shawshank is my favorite. Loved his char- acter Red. Person 1: Did you know that? Alpaca ... Person 1: yeah, people love him so he gets cast a lot. i suppose my favorite film of his is shawshank. that was great. which film of his do you like? Person 2: OMG! Shawshank is my favorite. Loved his char- acter Red. Person 1: yeah, he’s great in that role. what other films of his do you like? CDAIF(rel) ...Person 1: Haha I think you’re right about that! What does make me sad though is that cheerleaders are not treated well in the NFL... Person 2: Yeah, there is still a lot of sexism in the league. some things change very slowly, it seems. Person 1: Yeah I’m sure it will be interest- ing to see what happens next. Alpaca ...Person 1: Haha I think you’re right about that! What does make me sad though is that cheerleaders are not treated well in the NFL ... Person 2: Yeah, there is still a lot of sexism in the league. some things change very slowly, it seems. Person 1: Yeah it’s sad because they are just as important as the players and should be treated as such. I guess that’s why the NFL is losing viewers because of these kinds of things. What do you think about the college football scene? CDAIF(eng) ... Person 1: Ha ha, that’s hilarious. It defi- nitely helps you to know what’s going on. Were you also aware that there is a RB that never fumbled the ball in his career? Per- son 2: Yeah, he plays for the Patriots too. Probably contributes to their on going suc- cess. Person 1: You know what’s going on. That’s pretty cool. I can’t wait to see what happens next. Alpaca ...Person 1: Ha ha, that’s hilarious. It defi- nitely helps you to know what’s going on. Were you also aware that there is a RB that never fumbled the ball in his career? Per- son 2: Yeah, he plays for the Patriots too. Probably contributes to their on going suc- cess. Person 1: Yeah, I heard about that! He is really impressive. Did you know that the most memorable Super Bowl moment was when the Ea- gles won the Super Bowl in 2018 with a last-second FG? CDAIF(rel_eng) Table 6.3: Pairs of responses generated by Alpaca and CDAIF(rel), CDAIF(eng) or CDAIF(rel_eng) models. The winner responses are the preferred ones by the majority of annotators. 65 Chapter7 RelatedWork We discuss related work in the context of automatic evaluations of open-domain dialogue systems with respect to model architecture and train data construction approaches. We particularly review previous research on engagement, relevance, coherence, and commonsense aspects. We conclude this chapter by reviewing previous works on controllable text generation. 7.1 Relevance Compared to ngram overlap-based metrics that have low performance in the open-domain set- ting (Liu et al., 2016), trainable automatic evaluation metrics correlate better with human judg- ments. Lowe et al. (2017) trains a hierarchical neural network model called Automatic Dialogue Evaluation Model (ADEM) on human-annotated data to predict the appropriateness score of dialogue responses. While RUBER (Tao et al., 2018a) presents a trained unreferenced metric on automatically generated negative samples resulting from random matching. It combines the unreferenced relevance score with referenced scores computed based on the semantic similarity between generated and reference responses. Recently, Mehri and Eskenazi (2020) proposed an unsupervised metric based on DialoGPT (Zhang et al., 2020c) that evaluates dialogue quality for eighteen fine-grained qualities including relevance. They define a set of predefined positive (e.g., "Wow! That’s really cool!") and negative (e.g., “That’s not even related to what I said”)) follow-up utterances for each of the fine-grained qualities and then compute the likelihood of DialoGPT model generating various follow-up utterances. 7.2 Engagement Engagement estimation has been addressed in many spoken dialogue systems based on a lis- tener’s multimodal behavior or acoustic features of conversations (Yu et al., 2004; Inoue et al., 2018). Heuristic measurements of engagement scores have been proposed by many researchers (Venkatesh et al., 2018; Khatri et al., 2018; Ghandeharioun et al., 2019). In the Alexa prize com- petition, the engagement score of dialogue systems is calculated based on the number of turns and the total duration of conversation (Venkatesh et al., 2018; Khatri et al., 2018). Ghandeharioun 66 et al. (2019) consider a dialogue system engaging when it has the ability to ask questions during a conversation and generate longer responses. All these conditions are necessary for an engaging conversation but are not sufficient. Zhang et al. (2020c) also include engagement evaluation as one of the eighteen fine-grained assessment qualities. Yi et al. (2019) apply automatic evaluation metrics to enhance the quality of responses generated by dialogue systems. They do not directly train a model to predict the engagement score; instead, they ask annotators about interestingness and willingness to continue the conversation. They use the answers to these two questions as a proxy for engagement, which requires additional human annotations. 7.3 Coherence Dialogue coherence evaluation is pertinent to discourse coherence since dialogue is counted as a multi-party discourse. Similar to discourse coherence, many original coherence evaluation metrics derived from the Centering Model for monitoring the local focus of utterances and their entities distribution (Grosz and Sidner, 1986; Miltsakaki and Kukich, 2004; Lapata and Barzilay, 2005). A group of studies assesses the coherence of dialogues with respect to entities and dialogue acts (Cervone and Riccardi, 2020; Mesgar et al., 2020). Another inspected approach for dialogue coherence evaluation is to represent dialogue in a structured graph format where contextually dependent neighbor utterances or concepts are connected nodes in the graph (Vakulenko et al., 2018; Mesgar et al., 2020; Huang et al., 2020a). Graph convolutional networks are used to complete this task. A high-quality training dataset is identified as one of the momentous and indelible components in automatic coherence evaluation. Some previous works construct such datasets by collecting human judgments (Higashinaka et al., 2014; Cervone and Riccardi, 2020). While, many recent works rely on a more timely and costly affordable approach by automatically generating negative samples. The utterances of the coherent conversations from human-human interactions are manipulated by shuffling their order, inserting or replacing irrelevant utterances (Vakulenko et al., 2018; Mesgar et al., 2020; Huang et al., 2020a; Zhang et al., 2021). 7.4 Commonsense The evaluation of open-domain dialogue systems has long been a challenge due to the sys- tem’s open-ended goal (Huang et al., 2020b), and simply scoring the overall quality is far from enough (Finch and Choi, 2020). Thus, researchers have decomposed the evaluation of open-domain dialogues into multiple facets and developed corresponding automatic evaluation metrics (Pang et al., 2020; Mehri and Eskenazi, 2020). Between all aspects commonsense has long been neglected. The closest related work for the commonsense evaluation is done by Zhou et al. (2021b), which is mainly about commonsense-focused dialogues collection but also proposes an automatic metric for commonsense evaluation by training an MLP regressor on both symbolic and neural features. 67 The symbolic features include the numbers of one-hop and two-hop triplets in ConceptNet (Speer et al., 2017) that can be found between the target response and its dialogue history. In the following subsections, we discuss previous work on event-relation extraction and CSKB compatibility, two main components of our proposed ACCENT metric (See section 5.2.10). 7.4.1 JointEvent-RelationExtraction While event extraction and relation identification are well-studied, how to jointly acquire them remains a challenge. We argue that joint event-relation extraction is an important problem because in practical use cases, the input is usually free-form text without pre-extracted events. Zhang et al. (2020a) is a pioneer work trying to jointly extract event and relation through a pipeline to automatically construct large knowledge graphs. Researchers in this work resort to rule-based methods for event extraction and train a classifier to predict the relation between the pair of events. 7.4.2 CSKBCompatibility CSKB population enlarges CSKB automatically by adding new links or nodes which are compatible with the commonsense knowledge of the existing CSKB. In Fang et al. (2021a,b), researchers try to add events from a large event knowledge graph to a CSKB. The compatibility test component of ACCENT is relevant to CSKB population task and it is defined in a more general setting where the head and tail of the tuple can be arbitrary events. 7.5 ControllableGeneration In many downstream generation tasks, we are interested in models that execute generation with an emphasis on specific features such as texts with positive sentiment (Ziegler et al., 2019; Lu et al., 2022), low likelihood of toxic content (Kim et al., 2023; Lu et al., 2022), preserved content and style (Upadhyay et al., 2022). Controllable text generative models are emerged for this purpose. The majority of the proposed CTGs rely on RL architecture where the control aspects are treated as rewards to optimize the policy. These rewards are obtained by human intervention where the human preferences are used as gold labels in the supervised learning setup (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022). CTG can also be a reliable approach to be applied in more complex settings such as dialogue generation where the outputs generated by the chatbot should be controlled to be flawless. Hu et al. (2022) propose a CDG model that guides the response generation via attributes in different granularities including specificity, sentiment, response-relatedness, question-asking, and response length. All these attributes are specified automatically through heuristic methods such as Nor- malized Inverse Document Frequency (NIDF) (See et al., 2019b) for identifying specificity and engagingness of outputs, cosine similarity between embeddings of the response and the last 68 utterance from the dialogue to estimate the relevance degree. These heuristic-based models do not have the temporal and financial limitations of human feedback. To follow the idea of reducing the burden on humans in the CTG training, Bai et al. (2022) change the gear of RL-based CTG models toward using Artificial Intelligence Feedback, which is called RLAIF. To control the harmfulness of generated responses, they start with a set of generations that could include harmful and toxic content. Then, they repeatedly revise those generations using a limited set of critics provided by humans and complete this stage by finetuning an LM on the final revised harmfulness responses. The finetuned model is then used to generate pairs of responses for harmful prompts, which provides a dataset used to train a preference model that takes the reward function role for the RL setup. Among generative models, LLMs (Zhang et al., 2020c; Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023; Chung et al., 2022) with their proven excellence in a wide range of tasks can be suitable choices for CTG. Their massive size and restricted availability to the public make the RL-based CTG impractical. Decoding-time post-processing CTG that freezes the generative model and modifies its token distribution can be an alternative solution. FUDGE (Yang and Klein, 2021b) trains a binary predictor that in each time step estimates whether the constraint will be satisfied if we complete the generation. The modification of the base LM output is accomplished by multiplying the trained model probabilities with the original LM’s output probabilities. DEXPERT (Liu et al., 2021) suggests another way of modifying the LM output distributions by adding the probability difference between “anti-expert” and “expert” LMs that are trained on samples with and without specific attributes. Both DEXPERT and FUDGE depend on token-level oracle guidance or extra labeled datasets that usually have different distributions with the data that base LM is trained on to train the auxiliary models. NADO (Meng et al., 2022) does not have these limitations. NADO includes a neural network trained on data generated by the base model to decompose a sequence-level feedback function into token-level guidance embedded to alter the LM output distributions. 69 Chapter8 Discussion 8.1 Conclusion We have made the following contribution in the automatic evaluation of open-domain dialogue systems: • In Chapter 2, we discussed the relevance evaluation of open-domain dialogue systems using contrastive learning. We explored the efficiency of contextualized word embeddings, specifically BERT embeddings (Devlin et al., 2019) in the training of automatic evaluation metrics. We applied different network architectures and objective functions to better utilize contextualized word embeddings, and show their positive effects. • In Chapter 3, we discussed and investigated the possibility and efficacy of estimating utterance-level engagement and defined a novel metric, predictive engagement, for auto- matic evaluation of open-domain dialogue systems. Our experiments demonstrated that (1) human annotators agree highly on assessing utterance-level engagement scores; (2) conversation-level engagement scores can be predicted from properly aggregated utterance- level engagement scores. Furthermore, we showed that the utterance-level engagement scores can be learned from data. To this end, we proposed a simple heuristic of applying conversation-level engagement scores to all utterances and using them as ground-truth labels for learning predictive engagement. We showed a more accurate evaluation metric with an improved correlation to human judgments by combining engagement and relevance scores. • In Chapter 4, we proposed semantic-level manipulations for incoherent (negative) data generation in the training of coherence evaluation of dialogues. We proposed injecting various types of incoherence sources, such as coreference inconsistency, irrelevancy, con- tradictions, and decrease engagement, to the AMRs of dialogues. We showed that such semantic-level manipulations result in more natural incoherent samples that could happen by state-of-the-art dialogue models. Compared to baseline methods, our proposed metric demonstrated higher correlations with human judgments on several dialog datasets by significant margins. 70 • In Chapter 5, we discussed the importance and complexity of commonsense evaluation in dialogue assessment. To the best of our knowledge, we systematically studied commonsense in dialogue systems for the first time. More specifically, we focused on event commonsense which concerns commonsense knowledge about events and their relations. We proposed a pipeline method that uses an intermediatesymbolicrepresentation for commonsense reasoning. Our metric first extracts event-relation tuples from the target response and its preceding dialogue history via a prompt-based generative model trained in a low-resource setting. Those extracted tuples bridge the gap between the free-form dialogue and the compact form of CSKB. Then, a compatibility score is computed to decide how well each extracted tuple aligns with the CSKB. Through extensive experiments, we showed that our proposed metric achieved better correlation with human judgments for dialogue common- sense evaluation than several well-designed baselines and enabled easier interpretability of results. We showed the better performance of the overall metric and each of the proposed components (event-relation extraction and compatibility test) under different experimental setups. • In Chapter 6, we examined the usage of the automatic evaluation metrics for the generation task. In particular, we showed an avenue for conducting decoding time Controllable Dialogue Generation (CDG) where control signals were provided by automatic evaluation metrics and the base generative models were LLMs. Evaluation metrics scores re-sort the probability distribution of output tokens and steer LLMs with subject to the constraints without touching LLMs parameters. We demonstrated the superiority of our proposed CDG through Artificial Intelligence Feedback via the guidance of automatic evaluation metrics against the original LLMs. 8.2 FutureDirections With the emergence of new dialogue LLMs, the role of automatic evaluation metrics becomes bolded. More research is needed in the context of automatic dialogue evaluation to be able to rely directly on these metrics and reduce the burden on humans. As discussed in Chapter 4, the quality of the generated negative samples in training evaluation metrics is important in reaching more reliable evaluators. It is highly required to automatically discover the type of issues in the dialogue models and apply relevant intelligent perturbations to replicate those errors. Due to the indescribable ability of LLMs, such perturbations can even be conducted with the help of these models by designing suitable prompts to guide the model to create negative samples. Our proposed framework in chapter 5 is based on a fixed set of event relations of the com- monsense knowledge in ATOMIC 20 20 which may fail to cover some potential event commonsense aspects. We believe augmenting the current framework with more commonsense resources is a worthwhile direction for further improvements. In addition, the event-relation extractor in our proposed metric is a T5 model fine-tuned in a low-resource setting. Although the current model 71 can yield fairly strong performance, it is still an important research direction to improve the joint event-relation extraction component because the extracted tuples serve as the intermediate sym- bolic representation for commonsense reasoning in our framework. Since human-extracted tuples are very costly to collect, it is worthy exploring improving this component through high-quality synthetic data construction or transfer learning. In Chapters 2, 3, and 5 we studied the turn-level automatic evaluation metrics, while in Chapter 4 we examined conversation-level metrics. Obviously, in the evaluation of conversation- level qualities not all turns have the same importance. As an example, greeting utterances usually are not rich in content therefore, they should not have a high impact on the conversation’s assessment. Constructing the conversation-level metrics on top of the turn-level metrics using attention techniques can be a promising direction. 72 Bibliography Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. CoRR, abs/2001.09977. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473. URL: http://arxiv.org/abs/ 1409.0473, arXiv:1409.0473. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. InProceedingsofthe7thLinguisticAnnotationWorkshopandInteroperabilitywith Discourse, LAW-ID@ACL 2013, August 8-9, 2013, Sofia, Bulgaria , pages 178–186. The Association for Computer Linguistics. Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang, Wenquan Wu, Zhen Guo, Zhibin Liu, and Xinchao Xu. 2021. PLATO-2: Towards building an open-domain chatbot via curriculum learning. In FindingsoftheAssociation forComputationalLinguistics: ACL-IJCNLP2021, pages 2513–2525, Online. Association for Computational Linguistics. URL: https://aclanthology. org/2021.findings-acl.222. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy. Association for Computational Linguistics. URL: https: //aclanthology.org/P19-1470. Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. InProceedingsofthe2017ConferenceonEmpiricalMethods inNaturalLanguageProcessing, pages 1442–1451. Association for Computational Linguistics. URL: http://aclweb.org/anthology/D17-1151. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020). 73 Donna K. Byron and Amanda Stent. 1998. A preliminary model of centering in dialog. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL ’98, August 10-14, 1998, Université de Montréal,Montréal,Quebec,Canada.ProceedingsoftheConference, pages 1475–1477. Morgan Kaufmann Publishers / ACL. Alessandra Cervone and Giuseppe Riccardi. 2020. Is this dialogue coherent? learning from dialogue acts and entities. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGdial 2020, 1st virtual meeting, July 1-3, 2020, pages 162–174. Association for Computational Linguistics. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Seehttps://vicuna.lmsys.org(accessed14 April 2023). Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In31stConferenceonnNeuralInformation Processing Systems (NeurIPS 2017). Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James Glass. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States. Association for Computational Linguistics. URL: https: //aclanthology.org/2022.naacl-main.311. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. Nico Daheim, Nouha Dziri, Mrinmaya Sachan, Iryna Gurevych, and Edoardo M Ponti. 2023. Elastic weight removal for faithful and abstractive dialogue generation. arXivpreprintarXiv:2303.17574. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. URL: http://arxiv.org/abs/1810.04805, arXiv:1810.04805. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedingsofthe2019Conference oftheNorthAmericanChapteroftheAssociationforComputationalLinguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. URL:https://aclanthology.org/N19-1423. Birk Diedenhofen and Jochen Musch. 2015. cocor: A comprehensive solution for the statistical comparison of correlations. PloS one. Nouha Dziri, Ehsan Kamalloo, Kory W Mathewson, and Osmar Zaiane. 2018. Augmenting neural response generation with context-aware topical attention. arXiv preprint arXiv:1811.01063. 74 Tianqing Fang, Weiqi Wang, Sehyun Choi, Shibo Hao, Hongming Zhang, Yangqiu Song, and Bin He. 2021a. Benchmarking commonsense knowledge base population with an effective evaluation dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8949–8964, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. URL:https://aclanthology.org/2021.emnlp-main.705. Tianqing Fang, Hongming Zhang, Weiqi Wang, Yangqiu Song, and Bin He. 2021b. Discos: Bridging the gap between discourse knowledge and commonsense knowledge. In Proceedings of the Web Conference 2021, pages 2648–2659. Sarah E. Finch and Jinho D. Choi. 2020. Towards unified dialogue system evaluation: A compre- hensive analysis of current evaluation protocols. InProceedingsofthe21thAnnualMeetingofthe SpecialInterestGrouponDiscourseandDialogue, pages 236–245, 1st virtual meeting. Association for Computational Linguistics. URL:https://aclanthology.org/2020.sigdial-1.29. Brett M Frischmann. 2021. Common sense commons: The case of commonsensical social norms. Available at SSRN 3781955. Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural LanguageProcessing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. URL:https://aclanthology.org/2021.emnlp-main.552. Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Àgata Lapedriza, and Rosalind W. Picard. 2019. Approximating interactive human evaluation with self-play for open-domain dialog systems. CoRR. Sarik Ghazarian, Behnam Hedayatnia, Alexandros Papangelis, Yang Liu, and Dilek Hakkani-Tur. 2022a. What is wrong with you?: Leveraging user sentiment for automatic dialog evaluation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4194–4204, Dublin, Ireland. Association for Computational Linguistics. URL: https://aclanthology.org/2022. findings-acl.331. Sarik Ghazarian, Zixi Liu, Tuhin Chakrabarty, Xuezhe Ma, Aram Galstyan, and Nanyun Peng. 2021. Discol: Toward engaging dialogue systems through conversational line guided response generation. InProceedingsofthe2021ConferenceoftheNorthAmericanChapteroftheAssociation for Computational Linguistics: Human Language Technologies: Demonstrations, NAACL-HLT 2021, Online, June 6-11, 2021, pages 26–34. Association for Computational Linguistics. Sarik Ghazarian, Yijia Shao, Rujun Han, Aram Galstyan, and Nanyun Peng. 2023. ACCENT: An automatic event commonsense evaluation metric for open-domain dialogue systems. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4398–4419, Toronto, Canada. Association for Computational Linguistics. URL: https://aclanthology.org/2023.acl-long.241. Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019a. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. CoRR, abs/1904.10635. 75 Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019b. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. In NeuralGen 2019. Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019c. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. CoRR, abs/1904.10635. Sarik Ghazarian, Ralph M. Weischedel, Aram Galstyan, and Nanyun Peng. 2020. Predictive engagement: An efficient metric for automatic evaluation of open-domain dialogue systems. InTheThirty-FourthAAAIConferenceonArtificialIntelligence,AAAI2020,TheThirty-Second InnovativeApplicationsofArtificialIntelligenceConference,IAAI2020,TheTenthAAAISymposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7789–7796. AAAI Press. Sarik Ghazarian, Nuan Wen, Aram Galstyan, and Nanyun Peng. 2022b. DEAM: Dialogue coherence evaluation using AMR-based semantic manipulations. InProceedingsofthe60thAnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 771–785, Dublin, Ireland. Association for Computational Linguistics. URL: https://aclanthology.org/2022. acl-long.57. Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwa- tra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019a. Topical-chat: Towards knowledge-grounded open-domain conversations. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 1891–1895. ISCA. Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019b. Topical-Chat: Towards Knowledge- Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pages 1891–1895. URL: http://dx.doi.org/10.21437/Interspeech.2019-3079. H. Paul Grice. 1975. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors,Speech Acts, volume 3 of Syntax and Semantics, pages 41–58. Academic Press, New York. Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Comput. Linguistics, 12(3):175–204. Chulaka Gunasekara, Seokhwan Kim, Luis Fernando D’Haro, Abhinav Rastogi, Yun-Nung Chen, Mihail Eric, Behnam Hedayatnia, Karthik Gopalakrishnan, Yang Liu, Chao-Wei Huang, et al. 2020. Overview of the ninth dialog system technology challenge: Dstc9. CoRR, abs/2011.06486. Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, and Ashwin Ram. 2018. Topic-based evaluation for conversational bots. CoRR. Ryuichiro Higashinaka, Toyomi Meguro, Kenji Imamura, Hiroaki Sugiyama, Toshiro Makino, and Yoshihiro Matsuo. 2014. Evaluating coherence in open domain conversational systems. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 130–134. ISCA. 76 Zhe Hu, Zhiwei Cao, Hou Pong Chan, Jiachen Liu, Xinyan Xiao, Jinsong Su, and Hua Wu. 2022. Controllable dialogue generation with disentangled multi-grained style specification and attribute consistency reward. IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing, 31:188–199. Lishan Huang, Zheng Ye, Jinghui Qin, Liang Lin, and Xiaodan Liang. 2020a. GRADE: automatic graph-enhanced coherence metric for evaluating open-domain dialogue systems. InProceedings ofthe2020ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2020,Online, November 16-20, 2020, pages 9230–9240. Association for Computational Linguistics. Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020b. Challenges in building intelligent open- domain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32. Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. In AAAI. Koji Inoue, Divesh Lala, Katsuya Takanashi, and Tatsuya Kawahara. 2018. Engagement recogni- tion in spoken dialogue via neural network by aggregating different annotators’ models. In Interspeech. Lisa Jin and Daniel Gildea. 2019. Amr-to-text generation with cache transition systems. CoRR, abs/1912.01682. Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169. Anjuli Kannan and Oriol Vinyals. 2017. Adversarial evaluation of dialogue models. CoRR, abs/1701.08198. URL: http://arxiv.org/abs/1701.08198, arXiv:1701.08198. Chandra Khatri, Anu Venkatesh, Behnam Hedayatnia, Ashwin Ram, Raefer Gabriel, and Rohit Prasad. 2018. Alexa prize-state of the art in conversational ai. AI Magazine. Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. 2023. Critic-guided decoding for controlled text generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612. Association for Computational Linguistics. URL: https://aclanthology.org/2023.findings-acl.281.pdf. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd InternationalConferenceonLearningRepresentations,ICLR2015,SanDiego,CA,USA,May7-9, 2015, Conference Track Proceedings. URL: http://arxiv.org/abs/1412.6980. Paul R. Kingsbury and Martha Palmer. 2002. From treebank to propbank. In Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, May 29-31, 2002, Las Palmas, Canary Islands, Spain. European Language Resources Association. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proc. ACL. URL: https://doi.org/10. 18653/v1/P17-4012. 77 Hoang Thanh Lam, Gabriele Picco, Yufang Hou, Young-Suk Lee, Lam M. Nguyen, Dzung T. Phan, Vanessa López, and Ramón Fernandez Astudillo. 2021. Ensembling graph predictions for AMR parsing. CoRR, abs/2110.09131. Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. InIJCAI-05,ProceedingsoftheNineteenthInternationalJointConference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30 - August 5, 2005 , pages 1085–1090. Professional Book Center. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. URL:http://aclweb. org/anthology/N/N16/N16-1014.pdf. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017a. Dailydialog: A manually labelled multi-turn dialogue dataset. InProceedingsoftheEighthInternationalJoint ConferenceonNaturalLanguageProcessing,IJCNLP2017,Taipei,Taiwan,November27-December 1, 2017 - Volume1: LongPapers, pages 986–995. URL: https://aclanthology.info/papers/ I17-1099/i17-1099. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017b. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International JointConferenceonNaturalLanguageProcessing,IJCNLP2017,Taipei,Taiwan,November27- December1,2017-Volume1: LongPapers, pages 986–995. Asian Federation of Natural Language Processing. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017c. DailyDialog: A manually labelled multi-turn dialogue dataset. InProceedingsoftheEighthInternationalJoint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing. URL:https://aclanthology.org/ I17-1099. Kexin Liao, Logan Lebanoff, and Fei Liu. 2018. Abstract meaning representation for multi-document summarization. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1178–1190. Association for Computational Linguistics. Chin-Yew Lin. 2004. Rouge: a package for automatic evaluation of sum- maries. URL: https://www.microsoft.com/en-us/research/publication/ rouge-a-package-for-automatic-evaluation-of-summaries/. 78 Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2021.acl-long.522. Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsuper- vised evaluation metrics for dialogue response generation. InProceedingsofthe2016Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2122–2132. The Association for Computational Linguistics. URL: http://aclweb.org/anthology/D/D16/D16-1230.pdf. Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. CoRR, abs/1903.08855. URL: http://arxiv.org/abs/1903.08855, arXiv:1903.08855. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Ro{bert}a: A robustly optimized {bert} pretraining approach. URL:https://openreview.net/forum?id=SyxS0T4tvS. Varvara Logacheva, Mikhail Burtsev, Valentin Malykh, Vadim Poluliakh, Alexander Rudnicky, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W Black, and Yoshua Bengio. 2018. A dataset of topic-oriented human-to-chatbot dialogues. Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1116–1126. Association for Computational Linguistics. URL:https://doi.org/10.18653/v1/ P17-1103. Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse andDialogue,2-4September2015,Prague,CzechRepublic, pages 285–294. The Association for Computer Linguistics. Ximing Lu, Sean Welleck, Jack Hessel, Jiang Liwei, Lianhui Qin, Peter West, Prithviraj Am- manabrolu, and Yejin Choi. 2022. Quark: Controllable text generation with reinforced unlearning. In 36th Conference onn Neural Information Processing Systems (NeurIPS 2022). Xiaojuan Ma. 2018. Towards human-engaged ai. In IJCAI. Manuel Mager, Ramón Fernandez Astudillo, Tahira Naseem, Md. Arafat Sultan, Young-Suk Lee, Radu Florian, and Salim Roukos. 2020. Gpt-too: A language-model-first approach for amr-to- text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational 79 Linguistics, ACL 2020, Online, July 5-10, 2020, pages 1846–1852. Association for Computational Linguistics. John McCarthy and Patrick J Hayes. 1981. Some philosophical problems from the standpoint of artificial intelligence. In Readings in artificial intelligence , pages 431–450. Elsevier. Sheila A. McIlraith and Kilian Q. Weinberger, editors. 2018. Proceedings of the Thirty-Second AAAIConferenceonArtificialIntelligence,(AAAI-18),the30thinnovativeApplicationsofArtifi- cialIntelligence(IAAI-18),andthe8thAAAISymposiumonEducationalAdvancesinArtificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press. URL: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/schedConf/presentations. Shikib Mehri and Maxine Eskénazi. 2020. Unsupervised evaluation of interactive dialog with dialogpt. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse andDialogue,SIGdial2020,1stvirtualmeeting,July1-3,2020, pages 225–235. Association for Computational Linguistics. Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with DialoGPT. InProceedingsofthe21thAnnualMeetingoftheSpecialInterestGrouponDiscourse andDialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics. URL: https://aclanthology.org/2020.sigdial-1.28. Gábor Melis, Chris Dyer, and Phil Blunsom. 2018. On the state of the art of evaluation in neural language models. In InternationalConferenceonLearningRepresentations. URL:https: //openreview.net/forum?id=ByJHuTgA-. Tao Meng, Sidi Lu, Nanyun Peng, and Kai-Wei Chang. 2022. Controllable text generation with neurally-decomposed oracle. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Mohsen Mesgar, Sebastian Bücker, and Iryna Gurevych. 2020. Dialogue coherence assessment without explicit dialogue act labels. In Proceedings of the 58th Annual Meeting of the Association for ComputationalLinguistics,ACL 2020,Online,July5-10,2020, pages 1439–1450. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st ICLR 2013, Workshop Track Proceedings. Eleni Miltsakaki and Karen Kukich. 2004. Evaluation of text coherence for electronic essay scoring systems. Nat. Lang. Eng., 10(1):25–55. Ahmadreza Mosallanezhad, Kai Shu, and Huan Liu. 2021. Generating topic-preserving synthetic news. In 2021 IEEE International Conference on Big Data (Big Data), pages 490–499. Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing,EMNLP2017, Copenhagen,Denmark,September9-11, 2017, pages 2241–2252. URL:https://aclanthology.info/papers/D17-1238/d17-1238. 80 OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances inNeural Information ProcessingSystems, 35:27730–27744. Bo Pang, Erik Nijkamp, Wenjuan Han, Linqi Zhou, Yixian Liu, and Kewei Tu. 2020. Towards holistic and automatic evaluation of open-domain dialogue generation. InProceedingsofthe58thAnnual Meeting of theAssociation forComputational Linguistics, pages 3619–3629, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020.acl-main.333. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002a. Bleu: a method for automatic evaluation of machine translation. In Proceedingsofthe40thAnnualMeetingofthe Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318. ACL. URL:http://www.aclweb.org/anthology/P02-1040.pdf. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002b. Bleu: a method for automatic evaluation of machine translation. In Proceedingsofthe40thAnnualMeetingofthe Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. URL:https://aclanthology.org/P02-1040. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings ofthe 2018 ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics: Human LanguageTechnologies, NAACL-HLT2018, NewOrleans, Louisiana, USA,June1-6, 2018, Volume 1 (Long Papers), pages 2227–2237. URL: https://aclanthology.info/papers/N18-1202/ n18-1202. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics. URL:https://aclanthology.org/ D19-1250. James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et al. 2003. The timebank corpus. In Corpus linguistics, volume 2003, page 40. Lancaster, UK. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL: https://s3-us-west-2.amazonaws.com/ openai-assets/research-covers/language-unsupervised/language_understanding_ paper.pdf. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9. 81 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020a. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020b. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. URL: http://jmlr.org/papers/v21/20-074.html. Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018. Event2Mind: Commonsense inference on events, intents, and reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 463–473, Melbourne, Australia. Association for Computational Linguistics. URL:https://aclanthology. org/P18-1043. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3982–3992. URL:https://aclanthology.org/D19-1410.pdf. Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Schütze, and Iryna Gurevych. 2020. Investigating pretrained language models for graph-to-text generation. CoRR, abs/2007.08426. Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. InProceedingsoftheAAAIconferenceonartificialintelligence , volume 33, pages 3027–3035. Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth. 2020. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 27–33, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020.acl-tutorials.7. Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019a. What makes a good conversa- tion? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the NAACL NAACL-HLT. Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019b. What makes a good conversa- tion? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference oftheNorthAmericanChapteroftheAssociationforComputationalLinguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Computational Linguistics. URL:https://aclanthology.org/N19-1170. Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. InProceedingsofthe53rdAnnualMeetingoftheAssociationforComputational Linguisticsandthe7thInternationalJointConferenceonNaturalLanguageProcessingoftheAsian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1577–1586. URL:http://aclweb.org/anthology/P/P15/P15-1152.pdf. 82 Yuanlong Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating high-quality and informative conversation responses with sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,EMNLP2017,Copenhagen,Denmark,September9-11,2017, pages 2210–2219. URL: https://aclanthology.info/papers/D17-1235/d17-1235. Linfeng Song, Daniel Gildea, Yue Zhang, Zhiguo Wang, and Jinsong Su. 2019. Semantic neural machine translation using AMR. Trans. Assoc. Comput. Linguistics, 7:19–31. Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context- sensitive generation of conversational responses. InNAACLHLT2015,The2015Conferenceof the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 196–205. The Association for Computational Linguistics. URL:http://aclweb.org/anthology/N/N15/N15-1020.pdf. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence . Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in conceptnet 5. InProceedingsoftheEighthInternationalConferenceonLanguageResourcesandEvaluation, LREC2012, Istanbul, Turkey,May 23-25, 2012, pages 3679–3686. European Language Resources Association (ELRA). Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In35stConferenceonnNeuralInformationProcessingSystems(NeurIPS2020, volume 33, pages 3008–3021. Hui Su, Xiaoyu Shen, Pengwei Hu, Wenjie Li, and Yun Chen. 2018. Dialogue generation with GAN. In (McIlraith and Weinberger, 2018), pages 8163–8164. URL:https://www.aaai.org/ ocs/index.php/AAAI/AAAI18/paper/view/16508. Leonard Talmy. 1988. Force dynamics in language and cognition. Cognitive science, 12(1):49–100. Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018a. RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. In (McIlraith and Weinberger, 2018), pages 722–729. URL:https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16179. Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018b. RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. InProceedingsoftheThirty-Second AAAIConferenceonArtificialIntelligence,(AAAI-18),the30thinnovativeApplicationsofArtifi- cialIntelligence(IAAI-18),andthe8thAAAISymposiumonEducationalAdvancesinArtificial Intelligence(EAAI-18),NewOrleans, Louisiana, USA,February 2-7,2018, pages 722–729. AAAI Press. 83 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A strong, replicable instruction- following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Bhargav Upadhyay, Akhilesh Sudhakar, and Arjun Maheswaran. 2022. Efficient reinforcement learning for unsupervised controlled text generation. Svitlana Vakulenko, Maarten de Rijke, Michael Cochez, Vadim Savenkov, and Axel Polleres. 2018. Measuring semantic coherence of a conversation. In International semantic web conference, pages 634–651. Springer. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In31stConferenceonNeural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki Metallinou, et al. 2018. On evaluating and comparing open domain dialog systems. arXiv preprint arXiv:1801.03625. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020.emnlp-demos.6. Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang, and Songlin Hu. 2022. ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3898–3907, Gyeongju, Republic of Korea. International Committee on Computational Linguistics. URL: https://aclanthology.org/2022.coling-1.342. Dongqin Xu, Junhui Li, Muhua Zhu, Min Zhang, and Guodong Zhou. 2020. Improving AMR parsing with sequence-to-sequence pre-training. In Proceedings of the 2020 Conference on EmpiricalMethods inNaturalLanguage Processing,EMNLP2020, Online,November16-20, 2020, pages 2501–2511. Association for Computational Linguistics. Kevin Yang and Dan Klein. 2021a. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for ComputationalLinguistics: HumanLanguageTechnologies, pages 3511–3535, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2021.naacl-main.276. 84 Kevin Yang and Dan Klein. 2021b. Fudge: Controlled text generation with future discriminators. In 2021 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguageTechnologies, pages 3511–3535. Association for Computational Linguistics. URL: https://aclanthology.org/2021.naacl-main.276.pdf. Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Kg-bert: Bert for knowledge graph completion. arXiv preprint arXiv:1909.03193. Yi-Ting Yeh, Maxine Eskénazi, and Shikib Mehri. 2021. A comprehensive assessment of dialog evaluation metrics. CoRR, abs/2106.03706. Sanghyun Yi, Rahul Goel, Chandra Khatri, Tagyoung Chung, Behnam Hedayatnia, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. CoRR. Chen Yu, Paul M. Aoki, and Allison Woodruff. 2004. Detecting user engagement in everyday conversations. In INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing. Zhou Yu, Leah Nicolich-Henkin, Alan W Black, and Alexander Rudnicky. 2016. A wizard-of-oz study on a non-task-oriented dialog systems that reacts to user engagement. In Proceedings of the SIGDIAL. Chen Zhang, Yiming Chen, Luis Fernando D’Haro, Yan Zhang, Thomas Friedrichs, Grandee Lee, and Haizhou Li. 2021. Dynaeval: Unifying turn and dialogue level evaluation. In Proceedings of the59thAnnualMeetingoftheAssociationforComputationalLinguisticsandthe11thInternational Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5676–5689. Association for Computational Linguistics. Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung. 2020a. ASER: A large-scale eventuality knowledge graph. In WWW, pages 201–211. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? InProceedingsofthe56th ACL. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018b. Personalizing dialogue agents: I have a dog, do you have pets too? InProceedingsofthe56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2204–2213. Association for Computational Linguistics. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018c. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics. URL: https://aclanthology.org/P18-1205. 85 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. Bertscore: Evaluating text generation with bert. InInternationalConferenceonLearningRepresentations. URL: https://openreview.net/forum?id=SkeHuCVFDr. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020c. DIALOGPT : Large-scale generative pre-training for conversational response generation. InProceedingsofthe58thAnnualMeetingoftheAssociation forComputationalLinguistics: SystemDemonstrations, pages 270–278, Online. Association for Computational Linguistics. URL:https://aclanthology.org/2020.acl-demos.30. Jiawei Zhou, Tahira Naseem, Ramón Fernandez Astudillo, and Radu Florian. 2021a. AMR parsing with action-pointer transformer. InProceedingsofthe2021ConferenceoftheNorthAmerican ChapteroftheAssociationforComputationalLinguistics: HumanLanguageTechnologies,NAACL- HLT 2021, Online, June 6-11, 2021, pages 5585–5598. Association for Computational Linguistics. Pei Zhou, Karthik Gopalakrishnan, Behnam Hedayatnia, Seokhwan Kim, Jay Pujara, Xiang Ren, Yang Liu, and Dilek Hakkani-Tur. 2021b. Commonsense-focused dialogues for response genera- tion: An empirical study. In Proceedings of the 22nd Annual Meeting of the Special Interest Group onDiscourseandDialogue, pages 121–132, Singapore and Online. Association for Computational Linguistics. URL: https://aclanthology.org/2021.sigdial-1.13. Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. 86
Abstract (if available)
Abstract
With the rapid development of open-domain dialogue systems in recent years, it is imperative to have precise evaluation metrics that correctly assess the quality of these systems. To this end, many researchers resort primarily to human evaluation, which is time-consuming, expensive and does not facilitate the model comparisons across research papers. Therefore, the existence of accurate automatic evaluation metrics that can accelerate the development cycle by assisting the process of architecture search and hyperparameter tuning is necessary. Reference-based metrics such as BLEU or ROUGE fail to correlate well with human judgment in open-domain settings as there can be potentially many plausible generations that do not overlap significantly with the limited set of given references. This failure leads the research towards learning-based evaluation metrics that are more sophisticated and reliable.
Automatic evaluation of open-domain dialogue systems has a multifaceted nature with many fine-grained quality aspects. This dissertation explores both turn-level and conversation-level facets of open-domain dialogue evaluation. We train models that automatically assess the relevance, engagement, coherence, and commonsense aspects of the responses generated by dialogue models. We formulate the evaluation as a classification task to identify the quality of the responses. To this end, we focus on training data and model architecture of these metrics as two main components that the metrics' quality strongly relies on them. We start with heuristic text-level manipulations such as random swapping of utterances to create negative samples for training evaluation metrics. Then, we show that such manipulations are insufficient to appropriately reflect the issues that occur in interactions between advanced dialogue models and human. To tackle this issue, we move forward toward proposing advanced semantic-level perturbations of human-written responses to generate challenging negative responses that are more likely to be generated by state-of-the-art dialogue models. Next, we complete our investigation on dialogue evaluation by concentrating on the model architecture of these metrics. To this end, we incorporate knowledge from knowledge bases and leverage prompt-based generative models in a low-resource setting. Finally, in addition to dialogue assessment, the main goal of automatic evaluation metrics, we leverage them as influential control factors to guide dialogue models and generate higher-quality responses.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incrementality for visual reference resolution in spoken dialogue systems
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Responsible artificial intelligence for a complex world
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
An investigation of fully interactive multi-role dialogue agents
PDF
Computational modeling of mental health therapy sessions
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Dialogue management in spoken dialogue systems with degrees of grounding
PDF
Building generalizable language models for code processing
PDF
Kernel methods for unsupervised domain adaptation
PDF
Active sensing in robotic deployments
PDF
Advancing robot autonomy for long-horizon tasks
PDF
Fairness in natural language generation
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Unsupervised domain adaptation with private data
PDF
Generating psycholinguistic norms and applications
PDF
Plant substructuring and real-time simulation using model reduction
PDF
Neural creative language generation
Asset Metadata
Creator
Ghazarian, Sarik
(author)
Core Title
Automatic evaluation of open-domain dialogue systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
09/12/2023
Defense Date
08/23/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automatic evaluation,dialogue systems,OAI-PMH Harvest,open-domain dialogues,reference-free metrics
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Dehghani, Morteza (
committee member
), Georgila, Kallirroi (
committee member
), Peng, Nanyun (
committee member
), Sukhatme, Gaurav (
committee member
)
Creator Email
sarikgha@usc.edu,sarikghazarian@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113340410
Unique identifier
UC113340410
Identifier
etd-GhazarianS-12377.pdf (filename)
Legacy Identifier
etd-GhazarianS-12377.pdf
Document Type
Thesis
Format
theses (aat)
Rights
Ghazarian, Sarik
Internet Media Type
application/pdf
Type
texts
Source
20230913-usctheses-batch-1097
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
automatic evaluation
dialogue systems
open-domain dialogues
reference-free metrics