Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Syntax-aware natural language processing techniques and their applications
(USC Thesis Other)
Syntax-aware natural language processing techniques and their applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SYNTAX-AWARE NATURAL LANGUAGE PROCESSING TECHNIQUES AND THEIR APPLICATIONS by Chengwei Wei A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2024 Copyright 2024 Chengwei Wei Acknowledgements First and foremost, I am deeply grateful to my doctoral advisor, Prof. C.-C. Jay Kuo, for granting me the opportunity to collaborate with him and for his unwavering guidance and support throughout my PhD journey. I came to know Prof. Kuo during his digital image processing class. His spirit and passion for research have been truly inspiring, leading me to join the lab in the summer of 2019, and driving me to pursue excellence in my academic journey. I deeply admire his persistence and efficiency in research endeavors. Additionally, I have greatly benefited from his writing and presentation skills. He always helps me revise my papers thoroughly and promptly, and provides me with valuable feedback after every seminar presentation. Moreover, I am grateful for his visionary insights into future research directions, exemplified by our collaborative effort on a survey paper on language models in the summer of 2022. This collaboration proved prescient, as LLMs emerged as a notable trend in the NLP field following the release of ChatGPT. The experience gained from working on this survey paper has equipped me with indispensable skills and knowledge that have greatly benefited my academic pursuits and job searches. I would also like to extend my gratitude to Prof. Antonio Ortega and Prof. Swabha Swayamdipta for their roles on both my qualifying and defense committees, and to Prof. Keith Jenkins and Prof. Keith Chugg for their service on my qualifying exam committee. Their insightful comments and feedback have been invaluable, encouraging me to approach problems from a broader perspective. My thesis would not have been completed without their valuable assistance. ii Furthermore, I wish to extend my gratitude to my collaborators. Special appreciation goes to Bin Wang for his guidance and for steering me towards the field of Natural Language Processing. He has served as a secondary advisor in numerous projects we have undertaken together, including word embedding learning, sentence similarity evaluation, and language model surveys. I am also thankful for Wei Wang’s mentoring when I initially joined the lab. Yun-Cheng Wang has been indispensable in the language model survey project, offering insights into integrating knowledge graphs and language models. Hong-Shou Chen has provided me with invaluable suggestions and shared their expertise on various machine learning problems. Additionally, I value the insightful discussions with Tsung-Shan Yang and Xuejing Tang, especially when addressing text and vision problems. I am grateful to USC for providing a conducive research, study, and teaching environment. Last but certainly not least, the unconditional support of my family has been invaluable. They have provided me with a solid foundation for my overseas study. Despite not being physically present during my overseas studies for over 5 years, they have always been ready to listen to my problems with compassion and offer assistance in overcoming challenges, both in life and academia. Throughout this doctoral journey, my family has been an unwavering pillar of strength, providing constant encouragement and understanding, which has been pivotal in sustaining my passion and perseverance. iii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Green Syntactic Structure Construction . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Syntax-Aware Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Word Mover’s Distance Computation and Its Application . . . . . . . . . . . . . . 5 1.2.4 Unsupervised Compressive Summarization . . . . . . . . . . . . . . . . . . . . . . 6 1.2.5 Sub-structure Beam Search for Structure Data Generation . . . . . . . . . . . . . . 7 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Syntactic Structure Construction in Sentence . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Syntactic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Static Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Syntactic Parsing in Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Sentence Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Sentence Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.1 Unsupervised Extractive Summarization . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Compressive Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Types of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2 Linguistic Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Architecture of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.4 Pre-trained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.5 Decoding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 iv 2.5.6 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.7 Efficient Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 3: Green Syntactic Structure Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Proposed GWPT Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Frequency Analysis of Embedding Dimensions . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Concise Representation with Adaptive N-grams . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Discriminant Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.4 Classification for POS Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Comparison with MultiBPEmb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.3 Comparison with Other POS Taggers . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.5 Effect of Parameters in XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 4: Syntax-Aware Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.1 DWE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.2 CEDWE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.1 Datasets, Experiment Setup, and Benchmarks . . . . . . . . . . . . . . . . . . . . . 68 4.3.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 5: Word Mover’s Distance Computation and Its Application . . . . . . . . . . . . . . . . . 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.1 Word Mover’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2 Syntax-aware Word Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.3 Syntax-aware Word Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.1 Semantic Textual Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Further Analysis on STS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.3 Sentence Classification & Re-Ranking . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.4 Visualization of SynWMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 6: Unsupervised Compressive Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Compressibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Oracle Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 v 6.3.1 Sentence extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3.2 Phrase extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.1 Datasets, Experiment Setup, and Benchmarks . . . . . . . . . . . . . . . . . . . . . 99 6.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 7: Sub-Structure Beam Search (SBS) for Structured Data Generation . . . . . . . . . . . . . 104 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.2 Sub-structure Score Calculation in Structured Data Generation . . . . . . . . . . . 107 7.2.3 Sub-Structure Beam Search Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Sub-structure Score Calculation & Experiments . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.1 Token Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.2 Experiments on Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.3.3 Confidence Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.4 Experiments on Model Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.5 Decoding Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 8: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.2.1 Mathematical Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.2.2 Interpretable Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vi List of Tables 2.1 Transformer-based PLMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Table of the number of parameters, training data, cost, and time of several large LMs, where blank cells indicate that the data are not available. The sources are cited if the data are not obtained from the original work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 Frequency partitioning and N-gram’s choices . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 POS tagging accuracy on UD’s test dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Comparison of model sizes and inference FLOP numbers of MultiBPEmb and GWPT. . . . 54 3.4 Comparison of POS tagging accuracy rate for PTB and UD test datasets, where [† ] denotes a method implemented by ourselves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 POS tagging accuracy using different N-grams for the UD dataset. . . . . . . . . . . . . . . 56 3.6 POS tagging accuracy using DFT on the UD test set . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Categories of universal syntactic relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Test accuracy comparison of several word embedding methods with the Logistic Regression classifier, where the best and the second-best results are displayed in boldface and with underscore, respectively. †: word embeddings are pre-trained on large corpora; ∗: word embeddings are trained on the text classification datasets. . . . . . . . . . . . . . . 68 4.3 Test accuracy comparison of several word embedding methods with the XGBoost classifier, where the best and the second-best results are displayed in boldface and with underscore, respectively. †: word embeddings are pre-trained on large corpora; ∗: word embeddings are trained on the text classification datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Classification accuracy results and the number of word-context sample pairs (in the unit of million) for the dependency-based contexts, where “DWE w/o K" means the proposed DWE method without the use of the keyword context. . . . . . . . . . . . . . . . . . . . . 69 vii 5.1 Spearman’s (ρ × 100) correlation comparison of unsupervised methods, where the best results of each word embedding are displayed in boldface. The number in the bracket show the performance gain or loss of our methods as compared with WMDcos+IDF. Results of [† ] are taken from [70]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Comparison of Spearman’s (ρ × 100) correlation of using the subtree and the n-gram in SWD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Comparison of test accuracy for the k-nearest neighbor sentence classification. The best results of each dataset are displayed in boldface. . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Experimental results on the AskUbuntu dataset with four rank-based evaluation metrics: 1) Mean Average Precision (MAP), 2) Precision@1 (P@1), 3) Precision@5 (P@5), and 4) Mean Reciprocal Rank (MRR). The best results are displayed in boldface. . . . . . . . . . . 88 6.1 ROUGE score of Oracles on datasets CNN/DM, XSUM and PubMed . . . . . . . . . . . . . 93 6.2 Flunecy measurement of Oracles. A lower PPL and a higher SLOR indicate better fluency. . 93 6.3 Summary average length and Compression ratio (in terms of the word) . . . . . . . . . . . 93 6.4 ROUGE SCORE of Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5 ROUGE SCORE on CNN/DM, XSUM, and PubMed datasets . . . . . . . . . . . . . . . . . . 100 6.6 Flunecy measurement of reference summaries, PacSum and ECSTE. A lower PPL and a higher SLOR indicate better fluency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1 F1 Score of decoding methods on attribute extraction. The best results are displayed in boldface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Average Precision and R@P90 of decoding methods, where the best and the second-best results are displayed in boldface and with underscore, respectively. The probabilities of generations are calculated by the CP method. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 viii List of Figures 1.1 Syntax in textual data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Two different syntactic structures of the sentence “I shot an elephant in my pajamas." . . . 3 2.1 Two different syntactic parse trees of the sentence "I prefer the morning flight through Denver" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 The example of a dependency parse tree example [156]. . . . . . . . . . . . . . . . . . . . 21 2.3 The use of different permutations in a natural sentence. . . . . . . . . . . . . . . . . . . . . 22 2.4 Illustration of the BPE merge operation conducted on the dictionary {“hug", “pug", “pun", “bun"}. The vocabulary is initialized with all characters. Then, a new subword is created by merging the most frequent pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 The structure of FFN LMs, where ut−N+1, ..., ut−1 denotes the preceding contexts of ut in a fixed-window, and P, H, and O are the dimensions of the projection, the hidden layer, and the output layer, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 The structure of RNN LMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7 The structure of a transformer [239]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8 Illustration of different transformer models, where BERT is the encoder-only model, GPT is the decoder-only model, and BART is the encoder-decoder model. . . . . . . . . . . . . . 33 2.9 An illustration of (a) LM pre-training, (b) standard fine-tuning, and (c) prompt-based fine-tuning (or prompt-tuning) [69]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10 Comparison of texts generated by the powerful GPT-2 large language model (LLM) using Beam search (left) and pure sampling decoding (right). Beam search yields degenerate repetition (in blue) while pure sampling results in incoherent gibberish (in red) [85]. . . . . 37 ix 2.11 The performance curves as functions of the pre-training dataset size, where the classifier probing measures the quality of the syntactic and semantic features, the minimum description length probing quantifies the accessibility of these features, the BLiMP curve measures the model’s knowledge of various syntactic phenomena, and the superGLUE measures the capability of handling NLU tasks [293]. . . . . . . . . . . . . . . . . . . . . . 44 2.12 The structure of ELECTRA (Efficiently Learning an Encoder that Classifier Token Replacements Accurately) [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 The system diagram of the GWPT method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 We plot the averaged normalized sign-change ratio (NSR) as a function of the sorted embedding dimension index from the smallest value (l = 1) to the largest value l = 768) against the Penn Treebank dataset using the BERT word embedding. We partition dimension indices into low-, mid-, and high-frequency sets using two elbow points with l = 50 and l = 751. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 The validation error rate as a function of the XGBoost tree numbers for each class on the UD datasets: (left) fastText and (right) BERT. . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Sorted discriminability for each feature dimension selected by DFT and validation and test accuracies on the UD dataset. A lower cross-entropy value indicates a more discriminant feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 The effect of the maximum depth and the tree number in XGBoost on GWPT for the UD test set: POS tagging accuracy (top) and the model size (bottom). . . . . . . . . . . . . . . . 58 4.1 The dependency parsing tree for an example sentence: He found a skinny and fragile dog in his backyard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Two example sentences and their corresponding dependency parse trees. The keywords spotted by our method are marked in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 The classification accuracy curves as a function of embedding dimensions for three datasets: (a) AG_NEWS, (b) DBpedia and (c) YelpReviewPolarity, where the tested dimensions are set to 50, 100, 200 and 300. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 The classification accuracy as a function of hop sizes for the AG_NEWS dataset, where the results are obtained by using only n-hop neighbor words in the dependency parse tree as contexts (namely, the keyword contexts are ignored), where n = 1, · · · , 6. . . . . . . . . 71 4.5 Visualization of the embedding spaces of (a) DWE and (b) CEDWE for the AG_NEWS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Visualization of the embedding spaces of (a) DWE and (b) CEDWE for the YelpReviewPolarity dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 x 5.1 The structure of a dependency parsing tree for an exemplary sentence: “He found a skinny and fragile dog in his backyard.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Illustration of the shortcoming of distance calculation in WMD and the improved SWD solution. The distance between words in SWD is decided by word embeddings and subtree embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 1&2-hop subtrees with open as the parent node in the dependency parsing tree for sentence: “I am not sure if you can open a bank account in France”. Stopwords are ignored in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 The average Spearman’s correlation curves on STS datasets as a function of hop sizes or window sizes for three word embeddings: (a) word2vec, (b) BERT and (c) SimCSE-BERT. . 84 5.5 The averaged pairwise cosine distance of words in a sentence of STS datasets with three embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6 Visualization of the word flow assigned by SWF, where weights are normalized. The higher the weight, the darker the color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7 Visualization of the word distance between “bank” in sentence S1 and words in sentence S2. Sentence S1: “We camped near the bank of the river.” Sentence S2: “I am not sure if you can open a bank account in France.” The darker the color, the larger the distance. . . . 87 6.1 Phrases that can be removed in the sentence: “the shot was fired at a dark coloured car by a white man .” PP = prepositional phrase, ADJP = adjective phrase, JJ = adjective . . . . . . 99 6.2 Exemplary summaries outputted by our model on the CNN/DailyMail and XSUM datasets. For illustration, the compressive summary shows the removed phrases using red strike-through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Exemplary failure cases on CNN/DailyMail datasets. For illustration, the compressive summary shows the erroneously removed phrases using blue strike-through . . . . . . . . 103 7.1 Examples of structured data generation using LLMs. . . . . . . . . . . . . . . . . . . . . . . 105 7.2 In structured data generation, we tokenize the LLM output into sub-structure sequences and assign a score to each prediction based on the prescribed method [266]. . . . . . . . . 108 7.3 Sub-structure Beam Search. Beam size is set to 2 for illustration. In the decoding process, beams with higher scores (highlighted in green) are kept. . . . . . . . . . . . . . . . . . . . 109 7.4 An example from the OA-Mine dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.5 Few-shot prompt template for product attribute extraction. . . . . . . . . . . . . . . . . . . 112 7.6 Confidence Estimator: it uses the hidden states from the LLM to assess the score of generated sub-structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xi 7.7 F1 score and decoding speed of decoding methods on AE-110K dataset using Gemma-2B. . 117 xii Abstract Syntax in language processing controls the structure of textual data, playing a crucial role in textual data understanding and generation. For example, syntax in natural language sentences governs the relationships between words, which is crucial for grasping the sentence’s overall meaning. In programming languages, syntax defines the proper combination of symbols, ensuring computers can interpret and execute code statements accurately. This thesis has two primary objectives: 1) Develop efficient methods for constructing syntactic structures. 2) Investigate the significance of syntax and integrate syntax-aware techniques into various Natural Language Processing (NLP) applications. We first build an efficient Part-of-speech (POS) tagger. POS denotes a word’s syntactic function in a sentence. This form of syntactic information is crucial for constructing sentence syntactic structures. In the rest of the thesis, we explore syntax-aware techniques in various NLP applications, including wordlevel, sentence-level, document-level, and structured-data-level tasks. On the word-level task, we apply syntax-aware techniques to word embedding learning. Word embedding methods learn word representations from context, which is based on the distributional hypothesis. Most previous word embedding methods use sliding windows to select sequential context, and the learned word embeddings are for general purposes. By the context selected by dependency parsing and enhancement from word-class mutual information, our proposed classification-specific dependency-based word embedding outperforms several state-of-the-art word embedding methods on text classification tasks. On xiii sentence-level tasks, we apply syntax-aware techniques to sentence similarity evaluation. Sentence similarity evaluation measures the semantic similarity between sentences, which is important in information retrieval, text summarization, and question answering. In this thesis, we propose a syntax-aware Word Mover’s Distance (SynWMD) algorithm to address the limitations of the original WMD. The SynWMD approach improves the performance of sentence similarity evaluation by incorporating the dependency parse tree technique in both word flow assignment and word distance modeling. On document-level tasks, we apply syntax-aware techniques to text summarization. In this thesis, we conduct a comprehensive study of the impact by further compressing the selected sentences in the summary. The results show that under syntactic compression rules, further compression of selected sentences can significantly enhance the performance of summarization models. Additionally, we propose an unsupervised compressive method that leverages word and sentence embeddings to select phrases and sentences as the final summary. The experimental results demonstrate that this method improves performance compared to traditional sentencelevel extractive text summarization. Lastly, on the structured-date-level task, we present a novel decoding method called Sub-structure Beam Search (SUBS) for generating structured data. Unlike conventional natural language, structured textual data, such as knowledge entities and tabular data, follows specific syntax formats. By incorporating sub-structure information from the structured data during the text generation decoding process, our decoding method significantly enhances the LLM generation quality of structured data. xiv Chapter 1 Introduction 1.1 Significance of the Research Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence (AI), and linguistics that deals with the interactions between computers and human languages. It enables computers to understand and generate languages. NLP techniques have become ubiquitous in daily life, with applications such as text grammatical error correction, question answering, and text summarization. More recently, the success of Large Language Models (LLMs) [270], such as OpenAI’s GPT series ∗ , have notably progressed language understanding and generation, drawing considerable attention to the field of NLP. Syntax in NLP defines the structure of textual data, making it essential for understanding and generating textual data. As depicted in Fig. 1.1, a syntactic parse tree analyzes the grammatical structure of natural language sentences, helping humans understand the relationships between words and comprehend the sentence’s overall meaning. Syntax extends beyond natural languages and holds significance in other types of textual data. For example, in programming languages, syntax controls the correct combination of symbols to form well-structured code statements, facilitating correctly understood and executed by computers. A knowledge entity stored in a JSON-like format adheres to specific syntax rules, ensuring accurate ∗ https://openai.com/blog/chatgpt/ 1 Figure 1.1: Syntax in textual data saving and retrieval processes. Similarly, tabular data arranges entries according to certain syntax rules, which are the relationships between data entries represented by columns, rows, and tables. Fig. 1.2 shows another example that demonstrates the importance of syntax in language understanding. There exists ambiguity without knowing the syntactic structure of the sentence “I shot an elephant in my pajamas.”. As the phrase “in my pajamas” can be interpreted as modifying either “I” or “elephant”, there are two possible interpretations of the sentence: 1. "I shot an elephant that was wearing my pajamas." In this interpretation, "in my pajamas" modifies "an elephant", indicating that the elephant was wearing the pajamas when it was shot. 2. "I, while wearing my pajamas, shot an elephant." In this interpretation, "in my pajamas" modifies "I", indicating that the speaker was wearing pajamas when he shot the elephant. These above examples show that syntax plays a crucial role in understanding and generating textual data. Given the crucial role of syntax, many NLP applications have considered it as a fundamental component in model design. In machine translation, syntax can help determine the appropriate word order and sentence structure in the target language. By accurately identifying the syntactic structure of a sentence, machine translation models ensure grammatically correct translations that convey the intended meaning. 2 Figure 1.2: Two different syntactic structures of the sentence “I shot an elephant in my pajamas." In text classification, syntax is leveraged to discern the presence or absence of specific syntactic structures, such as noun phrases, verb phrases, or adjective phrases, which serve as crucial features for text classification. Furthermore, in prompt engineering for LLMs, ensuring that the input adheres to an appropriate format—namely syntax—facilitates the model in understanding the user’s intent and generating relevant and coherent responses. In this thesis, our first focus is on part-of-speech (POS), a critical task for establishing word syntactic functions in a sentence. Our goal is to develop a lightweight and computationally efficient POS tagger. Subsequently, we employ syntax-aware techniques in various NLP applications, spanning word-level, sentence-level, document-level and structured-data-level tasks. Specifically, at the word level, our attention is on word embedding learning, which represents words as dense vectors in high-dimensional space to capture semantic and syntactic relationships between them. This is a crucial component for numerous NLP tasks. Moving to the sentence level, our focus shifts to sentence similarity evaluation, which measures the similarity or relatedness between sentences. It is important in information retrieval, text summarization, question answering, and machine translation. At the document level, we explore text summarization, which generates a concise and coherent summary of lengthy text, leading to efficiency in information retrieval and document processing. Lastly, our research explores structured data generation using LLMs. Structured data, such as knowledge entities and product catalogs, adhere to predefined formats, i.e., syntax. 3 With the evolving capabilities of LLMs in data generation, our emphasis is on enhancing their proficiency in generating structured data, extending beyond conventional natural language generation. 1.2 Contributions of the Research 1.2.1 Green Syntactic Structure Construction POS denotes a word’s syntactic function within a sentence. In this study, our goal is to develop lightweight and computationally efficient methods for predicting word syntactic functions. We introduce a novel wordembedding-based POS tagger named GWPT. The main contributions of this work are summarized below. • GWPT, a new efficient representation method for POS tagging derived from word embeddings, is proposed. It discards low-frequency dimension indices and adopts N-gram representations for those in the mid- and high-frequency sets to enhance the overall effectiveness of the proposed method. • Extensive POS tagging experiments are conducted to evaluate tagging accuracy, model sizes, and computational complexity of several benchmarking methods. GWPT offers competitive tagging accuracy with smaller model sizes and significantly reduced complexity. 1.2.2 Syntax-Aware Word Embedding In this work, two task-specific dependency-based word embedding methods are proposed for text classification. Our methods follow the PPMI matrix factorization framework and derive word contexts based on the dependency parse tree. The first one, called the dependency-based word embedding (DWE), chooses keywords and neighbor words of a target word in the dependency parse tree as contexts to build the wordcontext matrix. The second method, named class-enhanced dependency-based word embedding (CEDWE), learns from word-context as well as word-class co-occurrence statistics. DWE and CEDWE are evaluated 4 on popular text classification datasets to demonstrate their effectiveness. It is shown by experimental results they outperform several state-of-the-art word embedding methods. There are three main contributions of this work as summarized below. • We exploit the dependency relation in the dependency parse tree to construct more effective contexts consisting of both keywords and neighbor words. • We propose a mechanism to merge word-context and word-class mutual information into a single matrix for factorization so as to enhance text classification accuracy. • We conduct extensive experiments on large-scale text classification datasets with the logistic regression and the XGBoost classifiers to evaluate the effectiveness of the proposed DWE and CEDWE methods. 1.2.3 Word Mover’s Distance Computation and Its Application In this work, we propose SynWMD for sentence similarity evaluation. SynWMD incorporates the dependency parse tree technique in both word flow assignment and word distance modeling to improve the performance of sentence similarity evaluation. This work has the following three main contributions: • A new syntax-aware word flow calculation method is proposed. Words are first represented as a weighted graph based on the co-occurrence statistics obtained by dependency parsing trees. Then, a PageRank-based algorithm is used to infer word importance. • The word distance model in WMD is enhanced by the context extracted from dependency parse trees. The contextual information of words and the structural information of sentences are explicitly modeled as additional subtree embeddings. 5 • We conduct extensive experiments on semantic textual similarity tasks and k-nearest neighbor sentence classification tasks to evaluate the effectiveness of the proposed SynWMD. The code for SynWMD is available at https://github.com/amao0o0/SynWMD. 1.2.4 Unsupervised Compressive Summarization This work aims to investigate the importance of compressive summarization and develop an unsupervised approach to this technique. We first use Oracle algorithms to study the effects of the Rouge score, fluency, and compression ratio brought by further compressing the selected sentences. The results show that further compression of selected sentences can significantly enhance the performance of extractive summarization models. However, to ensure the fluency of the resulting summaries, it is necessary to consider syntactic rules. Then, by studying the relationships between sentences and words within the document using sentence and word embeddings, we design ranking functions that can identify and remove irrelevant sub-sentential units, resulting in more concise summaries. Our experimental results demonstrate that our methods can significantly improve the traditional sentence-level extractive methods by just adding a phrase extraction stage of our method. Our contributions can be summarized as follows: • We comprehensively examine the impact of further compressing extracted sentences on recently developed summarization datasets. Our experimental results demonstrate the significant performance gains of compressive summarization and highlight the importance of syntactic guidance in this approach. • We propose an effective unsupervised sub-sentential extractive summarization method using word and sentence embedding. Despite its simplicity, our method outperforms state-of-the-art unsupervised extractive summarization techniques, as demonstrated by our experimental results. 6 1.2.5 Sub-structure Beam Search for Structure Data Generation In this work, we introduce a novel text generation decoding method called Sub-structure Beam Search (SUBS). Unlike traditional decoding methods such as greedy search and token-level beam search, which solely consider token-level conditional probability, SUBS operates at the sub-structure level and incorporates the score for each generated sub-structure in structured data generation with LLMs. Experimental results on information extraction demonstrate that SUBS notably enhances the quality of LLM structured data generation without requiring additional training of the LLMs. Our contributions can be summarized as follows: • We propose SUBS decoding, which incorporates the scores of sub-structures when generating structured data. • We explore two sub-structure scoring methods: one aggregates the conditional probabilities of tokens in the sub-structure, while the other trains a small external model called the Confidence Estimator. • Experimental results of product attribute extraction across multiple LLMs demonstrate significant improvements in generation quality, as measured by F1 score and average precision, achieved by SUBS. 1.3 Organization of the Thesis The rest of the thesis is structured as follows: Chapter 2 provides an overview of the necessary background knowledge for this thesis, including syntactic structure construction for words and sentences. We also introduce existing work on word embedding methods, sentence similarity evaluation methods, and text summarization. Finally, a detailed introduction to the language model is provided. Chapter 3 presents our efficient model for POS tagging. In Chapter 4, we present a classification-specific word embedding 7 method that incorporates syntactic context and word-class mutual information to improve its effectiveness. Chapter 5 proposes a novel method for evaluating sentence similarity, called Syntax-aware Word Mover’s Distance (SynWMD), that leverages syntactic parsing to enhance performance. In Chapter 6, we study the importance of further compression of extracted sentences in text summarization and propose an effective unsupervised compressive summarization method. Chapter 7 proposes a novel decoding method, called sub-structure beam search (SUBS), for structured data generation with LLMs. Finally, in Chapter 8, we provide concluding remarks and describe future research directions. 8 Chapter 2 Research Background 2.1 Syntactic Structure Construction in Sentence To fully understand the meaning of a sentence, it is essential not only to understand the meaning of individual words but also to discern the relationships between them. Achieving this understanding involves two fundamental tasks: Part-of-Speech (POS) tagging and syntactic parsing. Part-of-speech tagging assigns grammatical attributes, or ’tags’, such as nouns, verbs, adjectives, etc., to each word in a sentence, thereby explaining the syntactic role of each word within the sentence’s structure. In addition, syntactic parsing aims to analyze the hierarchical relationships between words in a sentence, depicting the syntactic structure through parsing trees. Together, these tasks are important for sentence analysis, enabling humans and machines to comprehend and generate language. 2.1.1 POS tagging Part of speech (POS) tagging is one of the classical sequence labeling tasks. It aims to tag every word of a sentence with its POS attribute. As POS offers a syntactic attribute of words, POS tagging is useful for many downstream tasks, such as speech recognition, syntactic parsing, and machine translation. POS tagging methods can be categorized into rule-based, statistical-based, and DL-based three approaches as elaborated below. 9 Rule-based approach. Rule-based POS tagging methods [25, 37, 61] utilize pre-defined linguistic rules to assign POS tags to words in sentences. Generally, a rule-based POS tagger initially assigns each word its most likely POS using a dictionary derived from a large tagged corpus without considering its context. Then, it applies rules to narrow down and determine the final POS for each word. These rules are created by linguistic experts or corpus based on linguistic features of a language, such as lexical, morphological, and syntactical patterns. For example, switching the POS tag from VBN to VBD when the preceding word is capitalized [25]. While rule-based methods offer simplicity and interpretability, their performance is inadequate in the face of complex and ambiguous instances of a language. Statistical-based approach. Statistical-based POS tagging methods, also called stochastic tagging, utilize annotated training corpora to learn the statistical relationship between words and their associated POS tags. Specifically, they disambiguate words by considering the probability of a word occurring with a specific tag in a given context. Statistical-based POS tagging often adopts the hidden Markov model (HMM) [105, 120, 237], where POS tags are the hidden states and words in a sentence sequence serve as observations. HMM-based POS taggers aim to learn the transition probability (i.e., the probability of one POS tag succeeding another) and the emission probability (i.e., the probability of a word being emitted from a specific POS tag) from annotated training corpora. Besides HMM, other statistical models have also been considered such as the maximum entropy model [188, 295] and the conditional random fields (CRF) [4, 180, 219]. DL-based approach. DL-based POS tagging methods have gained popularity as their ability to capture linguistic patterns from a large number of training data and achieve high performance. Common models include recurrent neural networks (RNN) [21, 255, 256] and transformers [129]. The performance of DL-based taggers can be enhanced by integrating with other techniques such as character embeddings, adversarial training, or rule-based pre-processing. DL-based POS taggers outperform rule-based and statistical-based 10 methods at substantially higher computational and storage costs. Recently, large language models (LLMs) [29, 171] can manage POS tagging implicitly and address downstream NLP tasks directly. 2.1.2 Syntactic parsing Syntactic parsing is a natural language processing technique used to analyze the grammatical structure of a sentence. There are typically two syntactic parsings, dependency parsing and constituency parsing. Fig. 2.1 shows the parse trees corresponding to dependency parsing and constituency parsing, respectively. Dependency parsing identifies the dependency relationships between the words in a sentence and creates a directed graph representing these dependency relationships. In dependency parsing, each word in the sentence is represented as a node in the graph, and the dependency relationships between the words are represented as edges. The edges are labeled with the type of dependency relationship between the words, such as subject, object, or modifier. The resulting graph is called a dependency tree or a dependency graph. Constituency parsing is the process of analyzing a sentence to identify its syntactic structure and hierarchical organization based on the grammatical rules of a language. In constituency parsing, a sentence is divided into a hierarchy of phrases, each of which has a specific grammatical structure and serves a particular function within the sentence. These phrases are called constituents, and they can include nouns, verbs, adjectives, prepositions, and other parts of speech. Syntactic parsing is useful in various NLP applications. In word embedding learning, syntactic parsing can identify the most relevant context for a target word; In sentiment analysis, syntactic parsing can be used to identify the relationship between words in a sentence and find the important words, such as adjectives and adverbs, for sentiment analysis; In machine translation, syntactic parsing is used to convert a source language sentence into a tree structure, which is then used to generate a target language sentence with the same meaning; In question answering, syntactic parsing can be used to determine the syntactic structure of a question and match it with a syntactic structure of a sentence to identify the answer; In text 11 (a) Dependency Parse Tree (b) Constituency Parse Tree Figure 2.1: Two different syntactic parse trees of the sentence "I prefer the morning flight through Denver" summarization, syntactic parsing can be used to identify important parts of a sentence and summarize it into a shorter one. Recent developments in syntactic parsing have focused on the use of deep learning methods, such as convolutional neural networks and recurrent neural networks, and transformers, to learn syntactic structures from large corpora of text. For constituency parsing, two main approaches are chart-based and transition-based models. Chart-based models [108, 220] work by assigning scores over phrases and forming the constituency parse tree by maximizing the tree score. Transition-based models [139, 265] define a sequence of transition actions to build the tree incrementally. For dependency parsing, most models can be divided into two types, graph-based and transition-based. Graph-based methods [59, 135] assign scores to the edge for word pairs and form the parse tree using a maximum spanning tree. Transition-based models [33, 145] treat the parsing process as a sequence of state transitions, where each state corresponds to a partially parsed sentence, and the transitions correspond to adding new words and dependency arcs to the parse. 12 2.2 Word Embedding Static word embeddings and contextual word embeddings are two different approaches to representing words in natural language processing. Static word embeddings represent words using fixed vectors. On the other hand, contextual word embeddings take into account the context in which the word appears. The contextual word embedding for a word is changed based on different contexts. Most contextual word embeddings are obtained by pre-trained language models. We introduce them in Sec. 2.5. In this subsection, we focus on the introduction to static word embeddings. 2.2.1 Static Word Embedding Most static word embedding methods learn the word representation with the distributional hypothesis [66, 79]. That is, words with similar contexts are expected to have similar meanings. It is natural to take the context information into account in word embedding learning. Several word embedding methods were developed by following this idea. They can be categorized into count-based and prediction-based types. Matrix factorization methods are count-based word embeddings. They represent word contexts using global corpus statistics [124]. They first construct a word-context co-occurrence a matrix where each row represents a word in the vocabulary, each column represents a context word, and each entry represents the co-occurrence statistic between the two words. The SVD algorithm then decomposes this matrix to reduce the dimension so that lower-dimensional vector representations can be obtained. Word2vec [151] is a prediction-based word embedding methods. It is the widely-used neural model for static word embedding. The CBOW (continuous bag-of-words) and skip-gram models are two architectures of Word2vec. The CBOW learns word embeddings by predicting the target word given its context words. To achieve this, the model architecture consists of an input layer, a hidden layer, and an output layer. The input layer is a one-hot encoded representation of the context words, which is then projected onto the hidden layer using a weight matrix. The hidden layer is then averaged, and the resulting vector is 13 passed through a second weight matrix and a softmax layer to generate a probability distribution over the entire vocabulary. The target word is then predicted as the word with the highest probability. The skipgram model has a similar structure as CBOW, while it learns word embeddings by predicting the context words based on the target word. GloVe(Global Vectors for Word Representation) [175] combines the strategies of the matrix factorization method and Neural models. It uses gradient descent to reconstruct the global word-context cooccurrence matrix to learn word embeddings. Although these models appear different, they share some similarities. It was theoretically proved in [124] that the learning process of “skip-gram with negative sampling (SGNS)” actually factorizes a shifted PPMI matrix implicitly. Further study in [125] offered a connection between PPMI(Positive Pointwise Mutual Information), skip-gram, and GloVe models. Experimental results conducted on several intrinsic tasks indicate that none of these models are significantly better than others. 2.2.2 Syntactic Parsing in Word Embedding The syntactic information can be exploited in context construction to learn better word embeddings. For example, research in [173] takes syntactic relations into account in constructing the word-context cooccurrence matrix. The syntactic information was introduced to the skip-gram model in [123]. Furthermore, word embedding can be learned by predicting the dependency-based context. The second-order dependency contexts were proposed in [112]. In [128], weights were assigned to different dependencies in the stochastic gradient descent process so that selected contexts are not equally treated. More important contexts get higher weights. 14 2.3 Sentence Similarity Measures Recent studies on sentence similarity evaluation can be classified into two main categories: sentenceembedding-based and word-alignment-based methods. They are reviewed below. 2.3.1 Sentence Embedding One way to assess sentence similarity is through sentence embedding. That is, a sentence is first encoded into a vector with an encoder. The similarity of two sentences is then inferred from the distance of their embedded vectors, where a simple distance metric such as the cosine or the l2 distance can be used. As to sentence embedding methods, a simple and fast one is to pool word embeddings. Several weighting schemes [10, 249] were adopted by pooling-based methods for simple sentence embeddings. Yet, there is an anisotropic problem in word-embedding-based pooling methods; namely, the associated embeddings narrow in a cone region of the vector space [63]. It limits embeddings’ capability in sentence similarity evaluation. To address this limitation, post-processing techniques were proposed to alleviate the anisotropic problem. For example, principal components removal [159], BERT-flow [127], and BERT-whitening [223] can make sentence embedding more uniformly distributed so as to enhance the performance on sentence similarity assessment. Recently, methods [70, 189] fine-tune pre-trained models on labeled data or use self-supervised contrastive learning to achieve superior performance on sentence similarity tasks. 2.3.2 Word Alignment Alignment-based methods measure the word matching degree for sentence similarity evaluation. WMD is a popular alignment-based method. Its extensions are widely used in text similarity tasks. For example, Sentence Mover’s Similarity targets the similarity measure of long and multi-sentence text sequences [40]. They use both word embedding and sentence embedding to measure text similarity. Word Rotator’s Distance [287] shows that the norm of word embedding encodes word importance while the angle between 15 two word embeddings captures word similarity. Consequently, they assign word flow based on the norm of word embedding and compute the cosine distance for the similarity measure. Recursive Optimal Transport [263] is a structure-aware WMD method. It uses a binary or a dependency parse tree to partition a sentence into substructures of multiple levels. Then, text similarity is recursively calculated by applying WMD to substructures at the same level. Yet, since there is no interaction between substructures at different levels, its capability of sentence similarity measure can be affected. MoverScore [297] and BERTScore [290] are two newly developed alignment-based methods using contextual word embeddings. Built upon the same concept as WMD, MoverScore uses the Inverse Document Frequency (IDF) to assign word flow so that less frequent words get higher flow weights. Furthermore, instead of adopting static word embedding, it uses contextual word embedding. It incorporates the word’s contextual information in word embedding implicitly, which enables the distance measure between words more accurately. Unlike WMD which considers the matching degree between a word in one sentence and all words in the other sentence, BERTScore uses the greedy match between words, where each word is only matched to its most similar word in the other sentence. Both MoverScore and BERTScore offer state-of-the-art performance on text generation evaluation. 2.4 Text Summarization Text summarization is the process of creating a shorter version of a long text while retaining the most important information. There are two main approaches for text summarization, which are extractive summarization and abstractive summarization. Extractive Summarization selects the most important sentences or phrases from the original text and concatenates them to form a summary. The sentences are chosen based on their relevance and importance to the overall meaning of the text. Abstractive Summarization generates a summary that does not necessarily use the same words as the original text. Instead, it tries 16 to capture the essence of the text by generating new sentences that convey the most important information. In the following, we introduce the recent work on unsupervised extractive summarization. In addition, compressive summarization, which extracts sub-sentential units instead of entire sentences, is overviewed. 2.4.1 Unsupervised Extractive Summarization Most unsupervised extractive summarizations work at the sentence level. They study the relationship between sentences and select the most salient sentences as summaries. Many of the early summarization works [62, 150, 243] have been worked on ranking the sentence by measuring the similarity to other sentences. The similarity is usually computed by tf-idf [62], content overlapping [150], More recent work shifted to using pre-trained language models, like BERT and GPT, to learn better relationships between sentences and documents. PACSUM [298] utilizes BERT to capture the sentential meaning and incorporates position bias into centrality measurement so as to select the most salient sentences. STAS [277] ranks sentences by the attention matrice from a pre-trained hierarchical transformer. [172] measures the relevance and redundancy using Point-wise mutual information between sentences using language models. FAR [136] considers the distance to the document when selecting sentences so that multiple semantic facets will be caught. 2.4.2 Compressive Summarization Compressive summarization is a text summarization technique that selects sub-sentential units by compressing the selected sentences into a summary that avoids irrelevant or redundant information. This approach helps to generate a concise summary of long texts. Pre-neural methods [17, 60, 147] for compressive summarization have been developed. They use rule-based algorithms that use linguistic analysis, 17 such as part-of-speech tagging and syntactic parsing, to identify the most important sub-sentential units of a sentence. Recent works [56, 149, 276, 301] in compressive summarization have explored the use of neural models, which have shown promising results. They first obtain sentence and word binary labels, i.e., delete or keep, using Oracle under syntactic constraints. Then the neural models are trained under supervision. 2.5 Language Models Language modeling studies the probability distributions over a sequence of linguistic units, such as words. It is one of the most fundamental tasks and long-standing research topics in natural language processing (NLP). The developed language models (LMs) find applications in many computational linguistic problems such as text generation, machine translation, speech recognition, natural language generation, questionand-answer systems, etc. There are two major approaches to language modeling: 1) the statistical approach based on a relatively small corpus set, and 2) the data-driven approach based on a significantly larger corpus set. Conventional language models (CLMs) predict the probability of linguistic sequences in a causal manner. They can be learned by both language modeling approaches. The data-driven approach has become mainstream nowadays. It exploits a large number of corpora to train neural-network models, leading to pre-trained language models (PLMs). PLMs are then fine-tuned with task-specific datasets and objectives for downstream applications. The goal of CLMs is to model the probability distributions over sequences of linguistic units: P(u1, u2, · · · , ut), (2.1) 18 where ui can be either a character, a word, a phrase, or other linguistic units. CLMs attempt to predict the next linguistic unit in a text sequence given its preceding contexts: P(ut |u<t) (2.2) CLMs are also called auto-regressive language models since the units are predicted in a causal way. Estimating the probability of a text sequence as shown in Eq. (2.1) directly encounters the data sparsity problem. CLMs often estimate the joint probability of the text sequence by decomposing a text sequence into smaller units. For example, CLMs leverage the chain rule and the conditional probability to estimate the joint probability in the form of P(u1, u2, · · · , ut) = P(u1)P(u2|u1)P(u3|u1, u2)· · · P(ut |u1, ...ut−1). (2.3) Before the pre-training era, CLMs are often trained from scratch with a training corpus and, then, predict the probability of text sequences with respective applications. Representative models include Ngrams LMs [28, 65, 166], exponential LMs [18, 54, 196] and earlier neural LMs [16, 153]. CLMs give a high probability to natural text sequences occurring frequently in the real world. As a result, they play a fundamental role in text generation, speech recognition [13, 92, 94], and machine translation [27, 168, 280] until the emergence of PLMs. Nowadays, high-performance PLMs serve as the backbone of many NLP systems. They are not limited to the causal predictive functionality of CLMs and provide more different types of LMs. The differences between CLMs before the pre-training era and PLMs can be summarized below. • Training Methodology. With the development of deep learning, PLMs with neural network structures are pre-trained by collections of massive unlabeled corpora to learn generic knowledge which is then transferred to downstream tasks by task-specific fine-tuning. 19 • Causality Constraint. PLMs do not necessarily follow CLMs in predicting linguistic units as shown in Eq. (2.2). For example, bidirectional LMs [57, 143] use both preceding and succeeding contexts to predict the missing linguistic units via probability estimation: P(ut |u<t, u>t). (2.4) Bidirectional LMs do not follow the causality constraint and the chain rule in Eq. (2.3), to access the probability of a text sequence, which makes it inherently different from CLMs. • Token Representation. Apart from the differences in the training paradigm and probability modeling, PLMs adopt a different representation for basic units called tokens. PLMs represent tokens by embedding them in a high-dimensional continuous space such as word embeddings [176, 267] and sentence embeddings [70, 249, 251]. The new representations offer a flexible and powerful tool that enables PLMs to handle a wide range of tasks. The rest of the subsection is organized as below. We introduce several types of LMs that go beyond CLMs in Sec. 2.5.1, and provide an overview of common ways to decompose text sequences into smaller linguistic units in Sec. 2.5.2. Sec. 2.5.3 introduces different model architectures. We discuss the training procedures of LMs in Sec. 2.5.4. The text generation decoding methods of LLMs are discussed in Sec. 2.5.5. Common evaluation methods including, both intrinsic and extrinsic ones, are introduced in Sec. 2.5.6. We comment on the redundancy problem of LMs and analyze techniques for efficient LMs in Sec. 2.5.7. 2.5.1 Types of Language Models CLMs commonly refer to auto-regressive models that predict the next linguistic units given the preceding context as shown in Eq. (2.2). LMs can access the probability of a text sequence using the chain rule. The 20 Figure 2.2: The example of a dependency parse tree example [156]. goal of CLMs is to decode the probability of text sequences in a causal manner. In this section, we introduce more LMs that go beyond CLMs. 2.5.1.1 Structural LM Instead of predicting linguistic units in a sequential or reversed sequential order, structural LMs [31, 32, 76, 156, 268] predict linguistic units based on pre-defined linguistic structures such as dependency or constituent parse trees. Structural LMs utilize the linguistic structure to bring linguistically relevant context closer to the linguistic unit to be predicted. For example, given a parse tree structure, a structural LM can define the ancestor context A(ut) of ut as the sequence from the root node to the parent of ut . For example, the ancestor sequence of word ‘strong’ is {‘binoculars’, ‘saw’, ROOT} in Fig. 2.2. Then, the structural LM uses the ancestor context in the tree to predict the next linguistic unit as P(ut |A(ut)), (2.5) where A(ut) is the ancestor context of linguistic unit ut . Similar to CLMs, structural LMs are designed to model the probability of text sequences. Differently, structural LMs decode the sequence probability in the order of their synthetic structures. It has been successfully applied to sentence completion [76, 156] and speech recognition [31, 32]. 21 Figure 2.3: The use of different permutations in a natural sentence. 2.5.1.2 Bidirectional LM Instead of using the causal contexts to make predictions, bidirectional LMs utilize contexts from both directions as shown in Eq. (2.4). The masked LM is one representative bidirectional LM. It masks out linguistic units in a text sequence and, then, encodes their preceding and succeeding contexts to predict the masked linguistic units. Formally, the prediction can be defined as the estimation of the following conditional probability P(um|S¯), (2.6) where um is the masked linguistic unit and S¯ is the corrupted text sequence by replacing a certain number of linguistic units with [MASK] symbols. The goal of bidirectional LMs is to learn the inner dependency between linguistic units in an unsupervised manner. The trained model can inherit semantics meanings from large-scale unlabeled corpora. Different from CLMs that aim to model the generation probability of text sequences, pre-trained bidirectional LMs are used as the backbone that transfers the learned knowledge through further fine-tuning in various downstream applications. 2.5.1.3 Permutation LM CLMs and masked LMs have their own advantages and disadvantages. A masked LM needs to create artificial tokens such as [mask], which never occur in downstream tasks while CLMs only condition on 22 preceding context. The permutation LM [285] is a recently proposed LM that takes advantage of CLMs and masked LMs. Given an input sequence of linguistic units, permutation LMs randomize the order of input linguistic units and construct different permutations of the input sequence. Fig. 2.3 shows an example of different permutations given an input text sequence. Let Z be the set of all possible permutations. Permutation LMs predict the next linguistic unit, ut , in one permutation, Z, of the sequence based on P(ut |u Z <t), Z ∈ Z. (2.7) 2.5.2 Linguistic Units To estimate the probability of text sequences, LMs partition text sequences into small linguistic units such as characters, words, phrases, or sentences. This process is called tokenization. The resulting linguistic units are called tokens. Different languages and models may have different appropriate tokenization methods. Here, we focus on English and use it as an example. In this section, we examine typical tokenization methods used in language modeling according to unit sizes. 2.5.2.1 Characters LMs can model text sequences probability based on characters [90, 107, 190, 227, 279]. As compared with other linguistics units, using characters has a much smaller vocabulary size, leading to a smaller discrete space and model size. On the other hand, it is challenging to predict the next character. Usually, it requires a long historical context. This makes the performance of character-level LMs poorer than that of wordlevel LMs. In addition, the input and output lengths have to be longer to model the character distribution accurately. This results in higher computational costs, especially for auto-regressive decoding. Several LM methods use the combination of words and characters to alleviate the issue [100, 157, 240]. 23 2.5.2.2 Words and Subwords The most natural tokenization for English is to decompose a text sequence into words by white spaces. Many LMs apply word tokenization. However, there are several issues of naive word tokenization. The first one is the Out-Of-Vocabulary (OOV) problem. Because an LM has a pre-defined vocabulary size that cannot be arbitrarily large. Less frequent words and words with character-level errors may not be stored in the pre-defined vocabulary. Thus, they cannot be retrieved from the dictionary. Although one can extend the vocabulary size to alleviate this problem, it will increase the model size and still cannot handle all possible words. LMs beyond the word level still have the OOV problem while a single character is not semantically meaningful by themselves. Recently, researchers are in favor of decomposing words into subwords if they do not appear in the dictionary. This offers a flexible and effective solution to the OOV problem [155, 215]. Several subword segmentation algorithms are developed to boost the performance of LMs. They strike a balance between the good performance of word-level models and the flexibility of character-level models. Two subword segmentation approaches, statistics-based and linguistics-based, are presented below. Statistics-based Subword Tokenizers The statistics-based subword tokenizers generate subword vocabulary purely based on the corpus. The associated methods are derived from a compression point of view. They work by replacing the commonly appeared character sequences with a new symbol (word) that does not exist in the current vocabulary. Then, fewer bytes are needed for information transmission. Byte Pair Encoding (BPE). BPE [67] is a simple data compression technique that replaces the most common pair of bytes in a sequence by a single unused byte recursively. It was adopted by [215] to solve the word segmentation problem. That is, frequent characters or character sequences are merged to generate subwords. BPE is also used by several advanced PLMs such as GPT-2 [186] and RoBERTa [143] with the following algorithm, called the BPE merge operation. 24 Figure 2.4: Illustration of the BPE merge operation conducted on the dictionary {“hug", “pug", “pun", “bun"}. The vocabulary is initialized with all characters. Then, a new subword is created by merging the most frequent pair. 1. Prepare a training corpus and define the size of the subword vocabulary. 2. Split all words into characters. 3. Generate a new subword by merging a pair of characters or subwords with the highest frequency. 4. Repeat step 3 until the desired vocabulary size is reached. An illustration of the BPE merge operation conducted on a small dictionary is given in Fig. 2.4. WordPiece. [212] WordPiece is another data-driven subword algorithm. The difference between WordPiece and BPE is that WordPiece merges the pair of A and B if they have the highest score P(AB)/P(A)P(B) (rather than the highest frequency P(AB)) at each iterative step. For example, WordPiece merges the pair of “u” and “g” in Fig. 2.4 only if they have the highest value, P( ′ug′ )/P( ′u ′ )P( ′ g ′ ), as compared with other pairs. WordPiece is used as the tokenization method in BERT [57], DistilBERT [206], and Electra [41]. There are other statistics-based subword tokenizers such as Unigram [113]. SentencePiece ∗ , Huggingface tokenizers † , and OpenNMT ‡ are popular tokenizers. Their implementation contains the statisticsbased subword tokenization. Different subword tokenizers and their performance comparison are studied in [24]. ∗ https://github.com/google/sentencepiece † https://github.com/huggingface/tokenizers ‡ https://github.com/OpenNMT/Tokenizer 25 Linguistics-based Subword Tokenizers Linguistics-based subword tokenizers exploit the linguistic knowledge and decompose words into smaller grammatical units, such as morphemes or syllables. Such subword tokenizers are widely used in machine translation and speech recognition among different languages [2, 45, 46, 104, 198, 203, 208]. For example, in machine translation, words formed by compounding, affixation, or inflection, can be conveniently translated by translating the morphemes, respectively. However, linguistics-based subword tokenizers are not as popular as statistics-based ones due to the complexity and the rule-based nature of language decomposition. 2.5.2.3 Phrases The semantic meaning of a single word can be ambiguous because of various contexts and set collocations. Since the linguistic dictionary does not go beyond the word-level, the inter-word dependency is ignored. Phrase-level LMs replace common and cohesive word sequences by phrases [122, 192, 207, 225]. Phraselevel LMs are suitable for some applications. For example, it is observed in [207] that short words with fewer syllables in automatic speech recognition (ASR) are more frequently misrecognized than longer ones. Since phrases provide longer phone sequences than their constituents, they are more robust to recognition errors for ASR. 2.5.2.4 Sentences Auto-regressive LMs with smaller linguistic units (e.g., characters, words, subwords, and phrases) rely on conditional probabilities to estimate the probability of text sequences as given in Eq. (2.3). Sentence-level LMs [35, 91, 119, 194, 195] avoid the use of the chain rule. They generate sentence features and, then, model the sentence probability directly. This is because modeling the sentence probability directly is more 26 convenient than that in Eq. (2.3) in encoding the sentence-level information. It is also easier to encode the inter-sentence information such as the effects of preceding utterances in a dialog flow. 2.5.3 Architecture of Language Models In this subsection, we conduct a survey on several common architectures to model the probability distributions of text sequences. They are N-gram, maximum entropy, and neural network models. While there are other LM architectures, like Gaussian mixture LMs [3] and Hidden Markov LMs [114], we focus on the above-mentioned architectures due to their popularity in the research community. Furthermore, LMs can operate at various levels of linguistic units. For generality and consistency with most recent literature, we use the term ‘token’ to refer to all linguistic units leveraged by different LMs for the rest of this paper. 2.5.3.1 N-gram Models An N-gram consists of N consecutive tokens from a text sequence. N-gram LMs [28, 65, 166] assume that the probability of a token depends only on its preceding N-1 tokens and it is independent of other contexts. This is known as the Markov assumption. Thus, instead of using all historical contexts, N-gram LMs only use the previous N-1 tokens to predict the current one; namely, P(ut |u<t) = P(ut |ut−N+1:t−1). (2.8) N-gram LMs calculate the conditional probability by counting the occurrence time of N-grams given a training corpus as P(ut |ut−N+1:t−1) = C(ut−N+1:t) C(ut−N+1:t−1) . (2.9) 27 N-gram LMs simplify the token probability calculation based on previous N-1 tokens, but they encounter two sparsity issues. First, if an N-gram, (ut−N+1:t), never occurs in the training corpus, the probability for the next tokens being ut is zero. Second, if the (N-1)-gram, (ut−N+1:t−1), in the denominator never occurs, we cannot calculate the probability of any tokens. These sparsity issues can be alleviated by smoothing techniques. A simple smoothing method [98, 137], called additive smoothing, is to add a small value to the count for every N-gram so as to avoid zero in the numerator and the denominator in Eq. (2.9). However, this simple smoothing is still deficient because it assigns the same probability for N-grams that never occur in the training corpus. There are more advanced smoothing techniques such as back-off and interpolation [34, 39, 93, 102, 110] that achieve better probability estimation. In back-off, lower-order N-grams are used for probability estimation if higher-order N-grams do not occur. For example, if C(ut−3:t−1) = 0, we back off to compute P(ut |ut−2:t−1). In interpolation, different N-grams are considered for conditional probability computation. Mathematically, the N-gram probability is estimated by P(ut |ut−N+1:t−1) = λN P(ut |ut−N+1:t−1) + λN−1P(ut |ut−N:t−1) + λN−2P(ut |ut−N−1:t−1) + ... + λ1P(ut), (2.10) where λi is the weight for each n-gram and PN i=1 λi = 1. 2.5.3.2 Maximum Entropy Models Maximum Entropy models (also called the exponential models) [18, 54, 196] estimate the probability of text sequences using feature functions in the form of P(u|h) = exp(a T f(u, u<t)) P u′ exp(a T f(u ′ , u′ <t)), (2.11) 28 Figure 2.5: The structure of FFN LMs, where ut−N+1, ..., ut−1 denotes the preceding contexts of ut in a fixed-window, and P, H, and O are the dimensions of the projection, the hidden layer, and the output layer, respectively. where f(u, u<t) is the feature function that generates the feature of token u and its historical context u<t, P w′ exp(a T f(u ′ , u′ <t)) is a normalization factor, and a is a parameter vector derived by the Generalized Iterative Scaling algorithm [50]. The features are usually generated from the N-grams. 2.5.3.3 Feed-forward Neural Network (FNN) Models The discrete nature of the N-gram model is its performance bottleneck even with advanced smoothing techniques. Neural LMs embrace the continuous embedding space (distributed representation) to overcome the data sparsity problem. Feed-forward Neural Network (FNN) LMs [9, 16, 213, 214] is one of the earlier neural network models. An FNN LM takes historical contexts as the input, and outputs the probability distribution of tokens. As shown in Fig. 2.5, each token in the preceding context is represented as a vector through a projection layer (i.e., an embedding matrix). These vectors of tokens are sent to the hidden layer with H hidden units followed by non-linear activation. Then, a softmax function is used to obtain the posterior probabilities for 29 token candidates, denoted as P(ut = vi |ut−N−1:t−1), which represent the probabilities of token ut being vi , where vi represents the i-th token in the vocabulary, given a specific history ut−N−1:t−1 predicted by the language model. An FNN LM uses a fixed window to collect fixed-length contexts. It is essentially a neural version of N-gram LMs. The FNN LM have several advantages over the N-gram LM by projecting tokens into continuous space. First, it can handle unseen N-grams by representing each token as an N-gram with a dense vector space. Second, it is storage-efficient since it does not need to count and store the transition probability of conventional N-gram models. 2.5.3.4 Recurrent Neural Network (RNN) Models . It is clearly insufficient to use the historical context in a fixed-length to predict the next token. In contrast to the limited historical context used in the N-gram, maximum entropy and FNN LMs, Recurrent Neural Network (RNN) LMs [111, 153, 154, 226, 284] can exploit arbitrarily long histories to predict the next token. The structure of a vanilla RNN LM is shown in Fig. 2.6. A token ui in position i is first converted into a one-hot representation uˆi . Then, the recurrent hidden state, hi+1, is computed using the previous hidden state, hi , and the one-hot representation, uˆi , of token ui as hi+1 = f(Wuˆi + Uhi), (2.12) where f(·) is a non-linear activation function, W is the weight matrix of the connections from the input layer to the hidden layer, and U is the connection between the previous and current hidden layers, respectively. By iteratively computing the hidden states, RNN LMs can encode the historical context of varying 30 length. Finally, the output layer gives the conditional probability of tokens yt = g(V ht), where V is the weight matrix connecting the hidden layer and the output layer and g(·) is the softmax activation function. Figure 2.6: The structure of RNN LMs. In theory, RNN LMs do not need the Markov assumption. They can use all preceding history to predict the next token. However, the inherent gradient vanishing problem of RNN hampers the learning of the model [84]. Since the gradient may become very small over a long distance, model weights are actually updated by the nearby context in practice. Generally, RNN LMs cannot learn the dependency between the current token and its far-away historical context. Although an attention mechanism can be introduced to RNNs to alleviate this problem [12, 55]. The inherent sequential nature of RNNs makes them less powerful than transformer-based LMs with a self-attention mechanism. 2.5.3.5 Transformers The transformer architecture [239] can capture long-term dependencies and important sequence components by exploiting a self-attention mechanism. Unlike the recurrent structure of RNNs, a transformer is easy to parallelize in both training and inference. Its structure is shown in Fig. 2.7. It consists of an encoder and a decoder. Before being sent to the encoder, the input textual sequence is first converted to an embedding through an embedding layer plus positional embedding. Multi-head attention, which is an ensemble of multiple self-attention mechanisms, enables the transformer to capture more robust and diverse 31 Figure 2.7: The structure of a transformer [239]. attention between tokens. The other parts in the transformer encoder include feed-forward layers, residual connections, and normalization layers. The difference between the transformer encoder and decoder is that the transformer decoder has an additional masked multi-head attention layer. The masking ensures the decoder can only access preceding tokens of the current one, which makes the decoder auto-regressive. Based on different purposes, transformers have encoder-only, decoder-only, and encoder-decoder three variants as shown in Table 2.1 and Fig. 2.8. Encoder-only models can access all positions given an input and utilize bi-directional contexts to predict tokens. They are suitable for tasks requiring understanding full sentences, such as text classification. Transformer decoder-only models can only use previous tokens 32 to predict the current token (namely, auto-regressive models). They are good at text generation tasks such as story generation. Transformer encoder-decoder models can access all tokens in the encoding phase, and tokens before the current token in the decoding phase. They are suitable for sequence-to-sequence tasks such as translation and summarization. Encoder-only models (Bidirectional) BERT [57] RoBERTa [143] ELECTRA [41] Decoder-only models (Unidirectional) PaLM [38] GPT-1, 2 and 3 [29, 185, 186] Transformer XL [49] Encoder-Decoder models (Sequence to sequence) BART [126] T5 [187] Table 2.1: Transformer-based PLMs. Figure 2.8: Illustration of different transformer models, where BERT is the encoder-only model, GPT is the decoder-only model, and BART is the encoder-decoder model. 33 2.5.4 Pre-trained Language Models Pre-trained language models (PLMs) are dominating in the NLP field nowadays. With the development of deep learning, the training and usage of PLMs have changed a lot as compared with conventional statistical LMs. Before being applied to real-world tasks, PLMs are first pre-trained on massive collections of corpora so that they learn universal representations that carry both syntactic and semantic knowledge. After pretraining, PLMs are fine-tuned for downstream tasks so that the acquired knowledge can be transferred to different tasks. In the following, we first explain the pre-training objectives in Sec. 2.5.4.1 and then talk about how to adapt PLMs to various tasks of interest through fine-tuning in Sec. 2.5.4.2. It is also worthwhile to point out several good survey papers on PLMs, e.g., [78, 140, 183]. 2.5.4.1 Pre-training . The most commonly used pre-training task is “missing token prediction”. There are other pre-training tasks for different purposes, e.g., next-sentence prediction, which allows an LM to learn sentence relationships. Word Prediction: Auto-regressive language LMs [29, 185, 186] are trained to predict the next token using previous tokens. While bidirectional LMs [57, 118, 143] mask a subset of tokens in a sample and learn to predict such masked tokens using the rest of the context. For the latter, the most popular objective is the masked language model (MLM) objective as proposed in BERT [57]. The MLM objective is the crossentropy loss in predicting masked tokens. It randomly masks out 15% of the input tokens and then predicts the masked tokens. The number of masked tokens is set to 15% based on experimental verification. If the masking rate is too small, the model only learns from a limited number of masked tokens. On the other hand, if it is too large, there is not enough context to do reasonable predictions and models cannot learn well. 34 Other Pre-training Tasks: There are other pre-training tasks to make LMs learn better linguistic knowledge such as sentence relationships. For example, next sentence prediction is used as the pre-training task in BERT [57]. Next sentence prediction is formalized as a binary prediction task that decides whether two sentences are two consecutive sentences or not. In this way, a PLM can be used in downstream tasks that require the understanding of the relationship between two sentences, such as Question Answering (QA) and Natural Language Inference (NLI). Other pre-training objectives are adopted by BART [126]. They include token deletion, text infilling, sentence permutation, and document rotation to corrupt the original sequence for reconstruction. Shuffled tokens are used in T5 [187] to increase the robustness of the learned representation. 2.5.4.2 Fine-Tuning, Adapter Tuning and Prompt Tuning PLMs learn non-task-specific language knowledge in the pre-training stage. Fine-tuning performs taskspecific adaptations of the model so that they can be applied to different downstream tasks. The model parameters are updated in the fine-tuning stage. One approach is to design task-specific heads based on different label spaces and losses in different downstream tasks, then update the entire model and taskspecific heads. For instance, GPT [185] and BERT [57] added an extra linear output layer as task-specific heads in their original papers, and fine-tuned the entire set of parameters in the PLMs and the heads for various downstream tasks, such as natural language inference, question answering, semantic similarity, and text classification. To make the fine-tuning mechanism more parameter efficient, one can choose to only update certain layers of an LM and the task-specific heads. Adapter tuning [86, 88, 178] is proposed to make fine-tuning even more parameter efficient compared with updating the last layers of a PLM only. It injects additional compact layers, calls adapters, into the original PLMs. Then, the new adapter layers are updated, while the parameters of the original PLMs are 35 frozen during adapter tuning. In this way, the parameters of the original PLMs can be shared by different downstream tasks. PLMs are pre-trained by one or several pre-training objectives and, then, applied to different downstream tasks. The gap between pre-training tasks and downstream task-specific fine-tuning can be substantial. Prompt-tuning [140] is used to discover the potential of PLMs by mimicking the pre-training objectives in the fine-tuning or inference stage. As PLMs get more powerful, they can handle various downstream tasks by seeing a few examples without any gradient updates or fine-tuning. This is achieved by prompt-based fine-tuning (or prompt-tuning in short). The prompt can be divided into discrete prompts (also called hard prompts) and continuous prompts (also called soft prompts). A discrete prompt is a natural text template that could be manually designed by humans [29, 210, 211] or automatic methods [68, 179, 302]. On the contrary, continuous prompts [121, 134, 182, 300] are continuous vectors in the embedding space that do not correspond to real text. It sacrifices interpretability but relaxes the discrete prompt constraint in that prompts should be real texts. Fig. 2.9 shows an example of the pre-training task, fine-tuning and discrete prompt-tuning of MLMs. In the pre-training, MLMs are trained to predict masked tokens. Assuming that the downstream task is the sentiment analysis of the movie review. In standard fine-tuning, we train a new head on the top of a PLM and predict the sentiment labels. The original input appended with a designed prompt, say, ‘It was’, is sent to the PLM. The PLM has to assign probabilities to designed answers, which can be ‘great’ or ‘terrible’. If the probability of ‘great’ is higher, then the label of the input will be positive and vice versa. In this way, prompt-tuning converts a distinct downstream task to the token prediction task to narrow the gap between the pre-training and fine-tuning stages. 36 Figure 2.9: An illustration of (a) LM pre-training, (b) standard fine-tuning, and (c) prompt-based fine-tuning (or prompt-tuning) [69]. 2.5.5 Decoding Methods Decoding decides the next output linguistic unit to generate text. A good decoding method should generate coherent continuation given a context. As LMs get more sophisticated, decoding methods have played an increasingly important role. As shown in Fig. 2.10, deficient decoding methods lead to bad generated texts even with a powerful LM. There are two main decoding methods for text generation. Figure 2.10: Comparison of texts generated by the powerful GPT-2 large language model (LLM) using Beam search (left) and pure sampling decoding (right). Beam search yields degenerate repetition (in blue) while pure sampling results in incoherent gibberish (in red) [85]. Maximization-based decoding. This is the most commonly used decoding objective. Assuming that the model assigns a higher probability to a higher quality text which is closer to the ground truth written by humans, the maximization-based decoding strategy searches for tokens with the highest probability as 37 the generated text. Greedy search [278, 296] chooses the token with the highest probability as the next token in a greedy manner. Beam search [115, 130, 241] keeps a certain number of most likely tokens at each time step and selects the generated token sequences with the overall highest probability eventually. It avoids missing reasonable tokens that do not have the highest probability. Trainable decoding algorithms have been proposed recently. Trainable greedy decoding [74] is a neural-based solution that works as part of a neural machine translation decoder. It utilizes reinforcement learning to find a translation that maximizes a decoding objective. Sampling-based decoding. It chooses the next token from a set of sampled tokens. Because maximizationbased decoding depends highly on the underlying model probabilities and suffers from producing degenerate repetition, sampling-based decoding increases the diversity of generated texts by random sampling. However, the simple pure sampling may choose a token with low probability (from an unreliable tail distribution) as the next generated token. As a result, the generated text could be unrelated to the prefix, leading to incoherent gibberish. Top-k sampling [64] and Nucleus sampling [85] have recently been proposed to address this problem. Both Top-k sampling and Nucleus sampling sample from truncated LM distributions (i.e., sampling from the most probable tokens). Diverse Beam search [130] is a trainable sampling-based (stochastic) decoding algorithm based on the Beam search. It uses reinforcement learning to determine the beam diversity parameters for different inputs or tasks. 2.5.6 Model Evaluation There are two LM evaluation types: intrinsic evaluation and extrinsic evaluation. The intrinsic evaluation examines the internal properties of an LM while the extrinsic evaluation studies its performance in downstream tasks. 38 2.5.6.1 Intrinsic Evaluation Auto-regressive LM. LMs estimate the probability of text sequences. A good LM assigns higher probabilities to natural text sequences and lower ones to unreal or random text sequences. The perplexity is a common evaluation metric for this purpose. Given a testing text sequence, the perplexity, denoted by P P L, is defined as the inverse probability of the sequence normalized by the number of tokens. Mathematically, we have P P L(S) = N s 1 (P(u1u2...uN ) , (2.13) where S = u1u2...uN is a testing text sequence. The perplexity can be rewritten in form of P P L(S) = N vuutY N i=1 1 P(ui |u1...ui−1) . (2.14) A good LM should maximize the text set probability. It is equivalent to minimizing the perplexity. The lower the perplexity, the better the LM. Bidirectional Language Model. To calculate the inverse probability in Eq. (2.13), the auto-regressive LMs can use a sequence of conditional probabilities. However, this approach does not work for bidirectional LMs (or masked LMs). Several intrinsic evaluation metrics have been proposed for bidirectional LMs. The pseudo-log-likelihood score (PLL) [244] is defined as P LL(S) = X |S| i=1 log P(ui |S\i ), (2.15) where log P(ui |S\i ) is the conditional probability of token ui in sentence S with all remaining tokens. Instead of maximizing the joint probability of the entire text sequence, a good bidirectional LM should 39 maximize the probability of each token in the text sequence given other tokens. Based on PLLs, the pseudoPerplexity (PPPL) for corpora C is defined as [204] P P P L(C) = exp(− 1 N X S∈C P LL(S)). (2.16) Both PLL and PPPL provide effective means to measure the naturalness of sentences for a bidirectional LM. For example, it was shown in [204] that PLL and PPPL correlate well with the performance of an LM on downstream tasks, such as automatic speech recognition and machine translation. 2.5.6.2 Extrinsic Evaluation Any downstream task of LMs can be used for extrinsic evaluation. There are several common downstream tasks selected as extrinsic evaluation benchmarks. Two popular ones are GLUE (General Language Understanding Evaluation) [246] and SuperGLUE [245]. GLU is an evaluation benchmark for natural language understanding. It contains single-sentence tasks, similarity and paraphrase tasks, and inference tasks. SuperGLUE is an enhanced version of GLUE. It includes a new set of more challenging language understanding tasks, more diverse task formats, improved resources, and a public leaderboard. 2.5.6.3 Relation between Intrinsic and Extrinsic Evaluations If an LM achieves a lower perplexity, does that mean it can also perform well on downstream tasks? In other words, is there any correlation between pre-training tasks (based on token prediction) and the downstream tasks? There are many empirical studies on this question but few theoretical studies. 40 Empirical Studies. Researchers design experiments to understand what kind of knowledge is learned by an LM from the pre-training tasks. Examples include [71, 82, 106, 193, 230, 231]. They use part-ofspeech tagging, constituent labeling, and dependency labeling to measure the degree of syntactic knowledge learning, and named entity labeling, semantic role labeling, and semantic proto-role for testing semantic knowledge. Empirical studies show that pre-training tasks help LMs learn the linguistic knowledge such as the grammar [106] and the semantic role [193]. However, these experimental results can only be used as evidence supporting that the token prediction tasks benefit downstream tasks. They cannot explain the underlying mechanism. Theoretical Studies. Some researchers attempt to build the connection between LM’s perplexities and its performance on downstream tasks mathematically. The text classification tasks were studied in [209]. They first hypothesized and verified that text classification tasks can be reformulated as sentence completion tasks. Since the LM pre-training task is essentially a sentence completion task, it does help the text classification downstream task. Then, they quantified the connection mathematically and showed that the features from LMs that achieve ϵ-optimal in log-perplexity can linearly solve text classification tasks with O( √ ϵ) error. An underlying generative model was utilized in [271] to show the relationship between the pre-training tasks and the downstream tasks. Current theoretical studies are limited in the sense that only a specific downstream task (say, the text classification task) is considered and the proof holds under certain conditions. 2.5.6.4 Beyond Single Metric for LM Evaluation Except for the evaluation of LM’s performance on standard evaluation test sets, the LM performance on other aspects is also important in real-world applications, such as efficiency [15, 221, 232, 272], bias [1, 19, 148, 161], robustness [72, 97, 165, 170, 202, 258, 288], explainability [303], and logical consistency [191]. 41 In this section, we discuss evaluations on efficiency, bias, and robustness to provide a holistic review of evaluation aspects. Efficiency of LMs can be evaluated in several aspects, such as inference time, computational complexity, energy consumption, model size, and training data size. Some work [221, 232, 261, 272] calculated the computational complexity, approximate financial, and environmental costs of training PLMs. They also suggested practical steps to reduce expenses in NLP research and applications. Discussion on the model size of recently developed PLMs was given in [15]. In Sec. 2.5.7 of this paper, we also discussed several methods to achieve efficient LMs. Table 2.2 shows the number of parameters, training data, cost, and time of recently developed LMs. Bias in NLP refers to systematic prejudices of models resulting from erroneous assumptions, such as racism, sexism, and ableism. Bias is reflected in PLMs since they are trained on a large volume of real word data. Several studies have examined bias in PLMs. The Sentence Encoder Association Test (SEAT) was proposed in [148] to investigate bias in BERT [57]. A dataset was created in [161] to measure bias against gender, profession, race, and religion across multiple PLMs, including BERT [57], RoBERTa [143], XLNet [285] and GPT-2 [186]. It was demonstrated in [1] that GPT-3 [29] consistently exhibits a significant anti-Muslim bias in various tasks. The work in [19] surveyed 146 papers on bias in NLP and made recommendations for analyzing bias in NLP systems. Robustness of LMs refers to their capacity to perform effectively and consistently when confronted with input variations (e.g., typos and misspellings) that should not affect the system’s output. In other words, a robust LM should not be easily fooled by adversarial text. Recent studies[97, 165, 288] created a set of character or word level perturbations to simulate various types of noise that LMs may encounter in realworld scenarios. They examined robustness of recently developed PLMs, including BERT, RoBERTa and XLNets. The results suggest that input perturbations, even minor alterations, can harm the performance 42 of these LMs. In addition, Robustness Gym [72], WildNLP [202], and TextFlint [258] are tools designed for robustness evaluation. 2.5.7 Efficient Models As recent PLMs get more powerful, their model size, training cost, and demand for training data increase tremendously. They need high computational resources and energy consumption, limiting their real-world applications. Table 2.2 shows the model size, training data, cost, and time of recently developed LMs. This issue is a concern to many people and the construction of efficient LMs has received attention. Model Year Number of Parameters Training data Training cost Training time BERT-Large 2018 340M 3.3B words $7,000 § 64 TPU chips 4 days XLNet-Lagre 2019 340M 32.9B tokens $245,000 § 512 TPU v3 chips 5.5 days GPT-2 2019 1.5B 8 million web pages $12,902–$43,008 [221] 32 TPU v3 chip 168 hours Megatron-LM 2019 8.3B 174 GB of text data 512 GPUs 2 days per epoch T5 2019 11B 745GB of text data Over $1.3 million [216] Turing-NLG 2020 17B GPT-3 2020 175B 570GB of text data Over $4.6 million ¶ 1024 A100 GPUs 34 days [163] Megatron-Turing NLG 2022 530B 270B tokens 2K A100 GPUs 3 months ∥ Table 2.2: Table of the number of parameters, training data, cost, and time of several large LMs, where blank cells indicate that the data are not available. The sources are cited if the data are not obtained from the original work 2.5.7.1 Data Usage Pre-training Data Size. A critical question for PLM training is how much data is needed. The effect of the pre-training data size on the RoBERTa model was studied in [294]. The learning curves of four model performance measures as a function of the pre-training dataset size are shown in Fig. 2.11. When the data size ranges between 100M and 1B words, three learning curves gradually level off and it implies that LMs § https://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/ ¶ https://lambdalabs.com/blog/demystifying-gpt-3 ∥ https://www.deepspeed.ai/ 43 encode most syntactic and semantic features. However, a much larger quantity of data is needed for LMs to acquire enough common-sense knowledge and other skills to achieve better performance on downstream NLU tasks. Figure 2.11: The performance curves as functions of the pre-training dataset size, where the classifier probing measures the quality of the syntactic and semantic features, the minimum description length probing quantifies the accessibility of these features, the BLiMP curve measures the model’s knowledge of various syntactic phenomena, and the superGLUE measures the capability of handling NLU tasks [293]. Efficient Pre-Training. Several methods have been proposed to use the pre-training data more efficiently. In the pre-training of masked LMs, a certain percentage of tokens are masked and need to be inferred by context. This approach incurs a substantial amount of computational cost because the network only learns from a certain percentage of tokens which are masked. To enhance training efficiency, the work in [41] uses “replaced token detection" (rather than “masked token prediction") as the pre-training task. As shown in Fig. 2.12, a generator is trained to perform the masked LM and predicts the masked tokens. Then, the main model works as a discriminator, called ELECTRA, which learns to decide the original or replaced tokens. In this way, pre-training tasks are conducted on all tokens instead of a small subset of masked tokens. Learning from all input positions causes ELECTRA to train much faster than BERT which adopts masked token prediction. Besides, ELECTRA achieves higher accuracy on downstream tasks when 44 it is fully trained. Later, a new pre-training task using an energy-based model, which is closely related to ELECTRA, is proposed in [42]. Figure 2.12: The structure of ELECTRA (Efficiently Learning an Encoder that Classifier Token Replacements Accurately) [41] Bridging Pre-training and Downstream Tasks. A typical pre-training task is token prediction, which often has a large gap with downstream tasks. To mitigate the gap between pre-training and downstream tasks, prompt tuning has been studied in [68, 186, 210, 218]. As illustrated in Fig. 2.9, the head is trained to predict the masked tokens in masked LMs. For the downstream sentiment analysis task, the head is trained to predict the positive or the negative label in traditional fine-tuning. A template (e.g., ‘It was’) and its expected text responses (e.g., ‘great’ and ‘terrible’) are used in prompt tuning. In this way, pre-training and prompt tuning share the same “token prediction" objective. 2.5.7.2 Model Size Besides improving training efficiency, efficient LMs focus on the design of models of smaller sizes. Many methods are investigated to reduce the model size so that the model can be implemented on mobile or edge devices with limited computing resources. Model compression is a widely studied topic. Compression methods first train a large LM and then compress it into a target size. Examples include model pruning [77, 242, 264], knowledge distillation [96, 206, 236], low rank matrix approximation [87, 144, 284], and parameter sharing [48, 53, 118, 197]. 45 2.5.7.3 Inference latency Inference efficiency is important to an LM, particularly in real-time applications. A model of a smaller size generally has faster inference speed under the same setting. Knowledge distillation, pruning, and low rank matrix approximation can be employed to achieve faster inference time while reducing the model size. For instance, DistilBERT [206], which is a distilled version of BERT, has demonstrated a 60% improvement in the inference speed compared to the original model. More than 2x speed-up in inference is achieved in [264] by pruning PLMs. Fast inference speed can also be achieved by fast decoding methods. Non-autoregressive generation (NAG) models [73, 132, 224] predict each token simultaneously. They have a faster inference speed than autoregressive models due to parallel computation. On the other hand, the performance of NAG models is generally worse than autoregressive models since they do not consider the forward or backward dependency between tokens in the output text. 46 Chapter 3 Green Syntactic Structure Construction 3.1 Introduction Part of speech (POS) tagging is one of the basic sequence labeling tasks. It aims to tag every word of a sentence with its part of speech attribute. As POS offers a fundamental syntactic attribute of words, POS tagging is useful for many downstream tasks, such as speech recognition, syntactic parsing, and machine translation. POS tagging is a crucial preliminary step in building interpretable NLP models. POS tagging has been successfully solved with complex sequence-to-sequence models based on deep learning (DL) technology, such as LSTM [21, 255, 256] and Transformers [129]. However, DL models demand higher computational and storage costs. There is a need for lightweight high-performance POS taggers to offer efficiency while ensuring efficacy for downstream tasks. We propose a novel word-embeddingbased POS tagger and name it GWPT to meet this demand. Following the green learning (GL) methodology [116], GWPT contains three cascaded modules: 1) representation learning, 2) feature learning, and 3) decision learning. The last two modules of GWPT adopt the standard procedures, i.e., the discriminant feature test (DFT) [282] for feature selection and the XGBoost classifier in making POS prediction. The main novelty of this work lies in the representation learning module of GWPT. 47 GWPT derives the representation of a word based on its embedding. Both non-contextual embeddings (e.g., fasttext) and contextual embeddings (e.g., BERT) can be used. GWPT partitions dimension indices into low-, medium-, and high-frequency three sets. It discards dimension indices in the low-frequency set and considers the N-gram representation for dimension indices in the medium- and high-frequency sets. Extensive experiments are conducted for performance benchmarking between GWPT and several DL-based POS taggers. Their results show that, as compared with DL-based POS taggers, GWPT offers highly competitive tagging accuracy with fewer model parameters and significantly lower complexity in training and inference. The rest of this paper is organized as follows. The GWPT method is described in Sec. 3.2. The experimental results are presented in Sec. 3.3. Concluding remarks are given in Sec. 3.4. 3.2 Proposed GWPT Method The system diagram of GWPT is depicted in Fig. 3.1. It contains four steps. Steps 1 and 2 belong to the unsupervised representation learning module. Steps 3 and 4 correspond to the supervised feature learning and the supervised decision learning modules, respectively. 1. Frequency Analysis of Embedding Dimensions. We analyze the frequency of each word embedding dimension and partition word embedding dimension indices into low-, mid-, and high-frequency sets. 2. Concise Representation with Adaptive N-grams. We adopt adaptive N-grams to each word embedding dimension based on their frequency analysis. The red block in Fig. 3.1 shows the N-gram ranges associated with word embedding dimensions of different frequencies. The adaptive N-gram design captures the essential contextual information for accurate POS prediction. 48 3. Discriminant Feature Selection. The dimension of concatenated N-grams of a word is still large. We adopt a semi-supervised feature extraction tool, DFT [282], to select features of higher discriminant power. 4. Classification for POS Labels. We perform the word-based POS classification task using an XGBoost classifier. These four steps are elaborated below. Figure 3.1: The system diagram of the GWPT method. 3.2.1 Frequency Analysis of Embedding Dimensions Consider an L-dimension word embedding scheme, which can be contextual- or non-contextual-based. We denote each dimension with Dl , l ∈ {1, · · · , L}, and define its frequency attribute as follows. Given a sentence of M words, we use the embedding of each word to construct a matrix, W, of L rows and M columns, whose vertical direction records embedding values, and horizontal direction is ordered by the word sequence. Let wl,m be the (l, m)-th element in W. A row of matrix W indicates the variation of 49 values of a specific dimension along the sentence. By removing its mean w¯l = PM m=1 wl,m, we obtain a zero-mean sequence xl where xl,m = wl,m − w¯l . For dimension Dl , we use the normalized sign-change ratio (NSR) of xl as its frequency attribute, which can be written as NSC(xl) = 1 M − 1 M X−1 m=1 δm,m+1, (3.1) where δm,m+1 = 0 if xl,m and xl,m+1 are of the same sign; otherwise, δm,m+1 = 1. Clearly, the NSC of a dimension takes a value between 0 and 1. Finally, we consider all sentences from a corpus, take the average of their NSR values, and assign the averaged NSR to each dimension as its frequency. A dimension of higher (or lower) frequency indicates signal xj fluctuates more (or less) frequently with respect to its mean value. Figure 3.2: We plot the averaged normalized sign-change ratio (NSR) as a function of the sorted embedding dimension index from the smallest value (l = 1) to the largest value l = 768) against the Penn Treebank dataset using the BERT word embedding. We partition dimension indices into low-, mid-, and high-frequency sets using two elbow points with l = 50 and l = 751. We plot the averaged NSR value of sorted embedding dimension indices against the Penn Treebank dataset using the BERT word embedding in Fig. 3.2. The dimension indices can be partitioned into low-, mid-, and high-frequency sets using two elbow points. They are denoted by Sl , Sm, and Sh, respectively. 3.2.2 Concise Representation with Adaptive N-grams We obtain the unsupervised features of a word as follows. 50 • Low-frequency dimensions We examined the POS of neighboring words and observed that 92.5% and 92.7% of neighboring words had different POS labels in the training sets of Penn Treebank [146] and Universal Dependencies [167], respectively. Since POS class labels change between neighboring words in a sentence, low-frequency embedding dimensions are not relevant to POS prediction. Thus, their values are discarded. • Mid-frequency dimensions The change rates of mid-frequency dimensions are higher, making them valuable for POS prediction and should be included in the representation vector. The 1- and 2-grams are used for contextual and non-contextual word embeddings, respectively, since contextual word embeddings contain the contextual information. Additionally, we apply Principal Component Analysis (PCA) with an energy threshold of 99% to filter out components corresponding to very small eigenvalues. • High-frequency dimensions The contextual information of a high-frequency dimension across multiple words proves to be useful for POS prediction. This is particularly valid for non-contextual word embedding methods (e.g., the same word “love” can be a verb or a noun depending on its context). It is beneficial to use Ngrams with a larger N value. Since the number of high-frequency dimensions is small, the cost is manageable. Additionally, we apply PCA to concatenated N-gram high-frequency dimensions for dimension reduction. Finally, we concatenate the N-grams from mid- and high-frequency dimensions to get a concise representation vector of a word. 51 3.2.3 Discriminant Feature Selection The dimension of the concise representation vector of a word from the previous step is still large. Since not all dimensions are equally important, it is desired to select a discriminant subset for two purposes. First, it can avoid the negative effects from noise or irrelevant features. Second, it can reduce the computational cost. For discriminant feature selection, we adopt a supervised feature selection method known as the discriminant feature test (DFT) [282]. DFT measures the discriminant power of each dimension of the input vector independently. For each dimension, DFT partitions its full range into two non-overlapping sub-intervals and uses the class labels of training samples to compute the weighted entropy from the two sub-intervals, called the loss function. DFT searches over a set of uniformly spaced points and finds the optimal point that minimizes the loss function. Then, the minimized loss function value is assigned to the feature as its DFT loss value. The smaller the DFT loss, the more discriminant the associated feature. Here, we use DFT to select the most discriminant subset of dimensions as features for POS prediction. 3.2.4 Classification for POS Labels After we get the discriminant features for each word, we train an XGBoost classifier [36] as the target classifier since it provides good performance and a relatively low inference complexity as compared with other classifiers. 3.3 Experiments 3.3.1 Datasets and Experimental Setup Datasets. We conduct experiments on two popular English POS tagging datasets: Penn Treebank (PTB) [146] and Universal Dependencies (UD) [167]. PTB contains material collected from the Wall Street Journal 52 (WSJ) with 45 POS tags. We adopt the common split of this dataset: Sections 0-18 (38,219 sentences) for training, Sections 19-21 (5,527 sentences) for development, and Sections 22-24 (5,462 sentences) for testing. UD consists of 183 treebanks over 104 languages. Its English UPOS (universal part-of-speech tags) has 17 POS tags. The default data split is used in our experiments. Experimental Setup. We consider non-contextual and contextual word embeddings with two representative examples. FastText [152] is a non-contextual word embedding scheme. The 300-dimensional FastText pre-trained on Wikipedia 2017 is used. Fasttext utilizes subword tokenization to address the Out-of-Vocabulary challenge, which is a serious issue in POS tagging. BERT [103] is a contextual word embedding scheme. We take the mean of embeddings of all layers as the final one. Both fastText and BERT embeddings employ subword tokenization. In our experiments, we utilize the mean pooling of subword embeddings as the embedding for the associated word. Table 3.1 lists the index ranges of mid- and high-frequency dimensions and their N-grams. We choose a smaller N for BERT, namely, the 1-gram for mid-frequency dimensions and the 1-gram and 2-gram for high-frequency dimensions. Since fastText is a non-contextual embedding, we compensate it with more gram types. We use DFT to choose 500 and 700 most discriminative features for fastText and BERT embeddings, respectively. Based on the validation sets, the XGBoost classifier has the maximum depth equal to 3, and it has 5000 trees and 4000 trees for fastText and BERT, respectively. Table 3.1: Frequency partitioning and N-gram’s choices . Word Embed. Frequency Indices N-grams FastText Low [0, 5] None Mid- [6, 260] 1,2 High- [261, 300] 1,2,3 BERT Low [0, 50] None Mid- [51, 750] 1 High- [751, 768] 1,2 53 Table 3.2: POS tagging accuracy on UD’s test dataset. Embeddings Fasttext BERT MultiBPEmb 94.30 96.10 GWPT (ours) 94.94 96.77 Table 3.3: Comparison of model sizes and inference FLOP numbers of MultiBPEmb and GWPT. Methods Modules Param. # FLOPs MultiBPEmb LSTM Layers 3,332 K (1.55X) 6,382 K (7.40X) GWPT Adaptive N-gram 281 K 522 K XGBoost 1,870 K 340 K Total 2,151 K (1X) 862 K (1X) 3.3.2 Comparison with MultiBPEmb We first compare the tagging accuracy of GWPT with another word embedding-based tagger, MultiBPEmb [80], on the UD dataset in Table 3.2. MultiBPEmb uses two Bi-LSTM layers and two Meta-LSTM layers with 256 hidden variables as the POS classifier. GWPT outperforms MultiBPEmb in prediction accuracy with both fastText and BERT embeddings. Next, we compare the model sizes and the computational complexity of GWPT and MultiBPEmb in Table 3.3, where the inference FLOPs (Floating-Point Operations) per word are used as the indicator of computational complexity. Since MultiBPEmb and GWPT use the same word embeddings (i.e., fastText or BERT), we do not include the cost of word embeddings in the table. Table 3.3 shows that GWPT has smaller model size and lower inference computational complexity than MultiBPEmb. The estimated model size and inference FLOPs for GWPT using fastText embedding on the UD dataset is given below. The main components that contribute to the model size and computational complexity in GWPT’s inference are adaptive N-grams and XGBoost. Other components, say, frequency partitioning and discriminant feature Selection, have negligible parameter counts and computational complexity. 54 • Adaptive N-grams. PCA is applied to N-grams. Since the mid-frequency range (from indices 5 to 260) encompasses 255 dimensions and involves 2-gram features, the parameter count for PCA is less than (255 × 2)2 = 260, 100. The high-frequency range (from indices 261 to 300) contains 40 dimensions and involves both 2-grams and 3-grams, and the parameter number for PCA is less than (40 × 2)2 + (40 × 3)2 = 20, 800. Thus, the total is bounded by 280, 900. The FLOPs for a PCA transform is 2 × m × n where m and n are input and output dimensions, respectively. Thus, the FLOPs is 2 × (490 × 490 + 80 × 80 + 120 × 120) = 521, 800 • XGBoost. A tree with a depth of 3 has 22 parameters. In multiclass classification problems, XGBoost employs the One versus Rest strategy. We use validation sets to select the tree number for each class in XGBoost. Fig. 3.3 illustrates the relationship between the validation error rate and the number of trees for each class in XGBoost. The error rates of the first 500 trees are excluded for better visualization. the validation error rates converge at 5,000 and 4,000 trees in each class for fastText and BERT embeddings, respectively. The UD dataset has 17 classes of POS. Thus, the total number of parameters for fastText is 5, 000 × 22 × 17 = 1, 870, 000. The FLOPs for an XGBoost classifier in each class are the number of trees times the tree depth. All trees’ predictions need to be summed up via addition. Thus, the FLOPs is (5, 000 × 3 + 5, 000) × 17 = 340, 000. Figure 3.3: The validation error rate as a function of the XGBoost tree numbers for each class on the UD datasets: (left) fastText and (right) BERT. 55 3.3.3 Comparison with Other POS Taggers We further compare the performance with other POS taggers for PTB and UD in Table 3.4. Meta-BiLSTM [22], Char Bi-LSTM [138] and Adversarial Bi-LSTM [286] are LSTM models built on character and wordbased representations. BiLSTM-LAN [47] is a multi-layered BiLSTM-softmax sequence labeler with an attention mechanism. Flair embeddings [7] adopts the character embedding. In addition, we fine-tune the whole BERT model with extra linear layers for POS tagging, which is denoted as BERT-MLP. We see that our method can still achieve competitive performance without character-level information and complicated training strategies. Table 3.4: Comparison of POS tagging accuracy rate for PTB and UD test datasets, where [† ] denotes a method implemented by ourselves. Methods PTB UD Meta BiLSTM 97.96 - Flair embeddings 97.85 - Char Bi-LSTM 97.78 - BiLSTM-LAN 97.65 95.59 Adversarial Bi-LSTM 97.58 95.82 BERT-MLP † 97.67 96.32 GWPT/BERT (Ours) 97.73 96.77 Table 3.5: POS tagging accuracy using different N-grams for the UD dataset. Word Embed. N-grams Feature Dim. Accuracy FastText 1 300 88.56 1, 2 1.5K 94.52 1, 2, 3 4K 94.82 Adaptive (ours) 2K 94.80 BERT 1 0.7 k 96.64 1,2 3.5K 96.72 1,2,3 9.6K 96.64 Adaptive (ours) 0.7K 96.72 56 3.3.4 Ablation Study We conduct ablation studies to illustrate the effects of adaptive N-grams and DFT. Adaptive N-grams. We compare the performance of two settings: 1) fixed N-grams for all dimensions of word embeddings, and 2) the proposed adaptive N-grams. in Table 3.5. FastText achieves its best performance using up to 3-grams. BERT embeddings require only 2-grams to boost the performance due to their inherent contextual information. Increasing the neighboring context, such as 3-grams, conversely impacts the results. Our adaptive n-grams achieves similar performance but with significantly reduced feature dimensions. DFT. Fig. 3.4 shows the curves of sorted discriminability (i.e., cross-entropy) for each feature dimension of word representation derived from fastText for the UD dataset. Within the same figure, we depict the validation and test accuracies for POS tagging using all the features selected by DFT up to the dimension index in the x-axis. We see consistent classification performance with the feature discriminability. Furthermore, we compare the performance of using the original adaptive n-gram features and the discriminative features selected by DFT in Table 3.6. It shows that the POS tagging accuracy can be further improved by removing irrelevant or noisy features using DFT. Figure 3.4: Sorted discriminability for each feature dimension selected by DFT and validation and test accuracies on the UD dataset. A lower cross-entropy value indicates a more discriminant feature. 57 Table 3.6: POS tagging accuracy using DFT on the UD test set Word Embed. Features Dimension Accuracy FastText Before DFT 1992 94.80 After DFT 500 94.94 BERT Before DFT 733 96.72 After DFT 700 96.77 3.3.5 Effect of Parameters in XGBoost We studied the impact of two important parameters for the XGBoost classifier: the maximum depth and the tree number. Figure 3.5 illustrates the performance of GWPT on the UD test dataset (top) and the model size of different tree maximum depths and tree numbers (bottom). Although GWPT’s performance improves as the tree maximum depth and the tree number increase. The model size grows greatly once the tree maximum depth is larger than 2 and the tree number is greater than 2000 while the improvement in accuracy is marginal. For this reason, we carefully set the maximum depth to 3 and the tree number to 4000 in order to strike a balance between performance and model size/complexity. Figure 3.5: The effect of the maximum depth and the tree number in XGBoost on GWPT for the UD test set: POS tagging accuracy (top) and the model size (bottom). 58 3.4 Conclusion and Future Work A novel lightweight word-embedding-based POS Tagger, called GWPT, was proposed in this work. GWPT was designed with a modular structure. It analyzed word embedding frequencies, employed adaptive Ngrams based on frequency intervals, selected discriminative features, and adopted the XGBoost classifier. It offered competitive POS tagging performance with few parameters and much lower inference complexity. In future work, we can exploit character embedding to boost the performance further. Additionally, the XGBoost classifier is not effective in handling multi-class classification problems since its model sizes increase rapidly. It would be interesting to design more efficient and lightweight classifiers for GWPT. 59 Chapter 4 Syntax-Aware Word Embedding 4.1 Introduction A word is represented by a real-valued vector through word embedding techniques. The technique finds applications in natural language processing (NLP) tasks such as text classification, semantic search, parsing, and machine translation [33, 51, 141, 250, 304]. Although contextualized word embedding methods [57, 177, 249] show superior performance, static word embedding methods still play an important role in many application scenarios because of their simplicity [124, 151]. Static word embedding methods can be categorized into count-based and prediction-based types. Positive point-wise mutual information (PPMI) matrix factorization method [124] is a count-based model, while word2vec [151] is a prediction-based model. GloVe is a hybrid model consisting of both [175]. Although these two models have different model structures, both learn word embedding from the co-occurrence information of words and their contexts. Context selection is one of the important research topics in word representation learning. Most word embedding methods adopt linear contexts solely based on the distance. For a target word, its surrounding words are chosen as its contexts. The context importance is inferred based on the distance to the target word. The farther the distance, the less the importance. An alternative is the dependency-based context. For each sentence, a syntactic dependency parse tree can be generated by a dependency parser. The neighbors of a target word in the tree are chosen as its contexts. Dependency-based contexts have 60 Figure 4.1: The dependency parsing tree for an example sentence: He found a skinny and fragile dog in his backyard. been studied in both count-based [173] and prediction-based methods [123]. Compared with linear contexts, dependency-based contexts can find long-range contexts and exclude less informative contexts even though they are closer to the target word. One example is shown in Fig. 4.1, where the target word is ‘found’. Guided by the dependency parsing tree, its closely related words, i.e., ‘he’ and ‘dog’, which are the subject and object, respectively, can be easily identified. By contrast, less related words, e.g.,‘skinny’ and ‘fragile’, which are the modifiers of ‘dog’, are gathered by linear contexts. Most work on dependency-based word embedding [112, 123] adopts word2vec’s skip-gram model. They use a target word to predict its contexts constructed by dependency parsing. After parsing, triples are obtained from the tree, each of which contains a head word, a dependent word, and the dependency relationship between them. For a target word, the concatenation of its head or dependent words and their corresponding dependency relation forms dependency contexts (e.g., the contexts of ‘found’ are he/nsubj, dog/obj, and backyard/obl in Fig. 4.1). Since the dependency relation generated by a dependency parser 61 captures the syntactic information of a word in a sentence, relevant contexts can be extracted and exploited using the dependency relation. However, most dependency-based word embedding methods treat all context equally. An important application of word embedding is text classification, which assigns class labels to texts. One way for text classification is to apply a classifier to text features which are computed from word embeddings. However, one issue in existing word embedding methods is that they only consider the contextual information but do not take the task-specific information into account. The task-specific information can be valuable for performance improvement. It is worthwhile to mention that some word embedding methods were proposed for text classification with improved performance by including the topical information [142] or the syntactic information [112]. This work [268] introduces two dependency-based word embedding methods specifically designed for text classification tasks. The proposed methods utilize the PPMI matrix factorization framework and rely on the dependency parse tree to derive word contexts. The first method, dependency-based word embedding (DWE), selects keywords and neighboring words of a target word in the dependency parse tree to construct the word-context matrix. The second method, class-enhanced dependency-based word embedding (CEDWE), combines word-context and word-class co-occurrence statistics to improve classification accuracy. The effectiveness of both DWE and CEDWE is demonstrated through extensive experiments on popular text classification datasets using the logistic regression and XGBoost classifiers. The results show that these methods outperform several state-of-the-art word embedding techniques. 4.2 Methodology We propose two new word embedding methods in this section. They are: DWE (Dependency-based Word Embedding) and CEDWE (Class-Enhanced Dependency-based Word Embedding). CEDWE is an enhanced 62 version of DWE. Both use the PPMI matrix factorization method as the basic word embedding framework, which is briefly reviewed below. Pointwise mutual information between a pair of word-context (w, c) is defined as PMI(w, c) = log P(w, c) P(w)P(c) , (4.1) where P(w), P(c) and P(w, c) represent the probability of word w, context c and the joint probability of word w and context c, respectively. The PMI can be estimated by PMI(w, c) = log N(w, c) · |N| N(w) · N(c) , (4.2) where N(w), N(c) and N(w, c) represent the number of times word w, context c and word-context pair (w, c) occur in a corpus, respectively. |N| is the total number of times all possible word-context pairs (w, c) occur. The PPMI matrix factorization method first counts the occurrence of words and contexts, and the cooccurrence of word-context pairs in the training corpus to estimate the PMI matrix. Then, the PPMI matrix X is built by replacing all negative elements in the PMI matrix by 0, i.e., P PMI(w, c) = max(PMI(w, c), 0). (4.3) Matrix X can be factorized with singular value decomposition (SVD) in the form of X = UΣV, (4.4) 63 Table 4.1: Categories of universal syntactic relations. Figure 4.2: Two example sentences and their corresponding dependency parse trees. The keywords spotted by our method are marked in red. and the lower dimension matrix, UΣ, is adopted as the learned word embedding representation. By following this framework, we study the use of dependency parsing to construct the word-context matrix. Then, we show how to enhance word embedding by exploiting word distributions in different classes. 4.2.1 DWE Method Existing dependency-based word embedding methods usually model the context by concatenating the dependency relation and the word connected by this relation [112, 123, 128]. Yet, it leads to a much larger 64 vocabulary size and memory when performing matrix factorization. As a result, the solution is only suitable for small-sized corpora and cannot be extend to larger ones. By contrast, we utilize dependency parsing to collect related words for a target word and use them as contexts in constructing the word-contexts co-occurrence matrix. The dependency relation can be dropped to avoid the above mentioned problems. Our DWE method chooses two types of words in a dependency parse tree as contexts. They are: • Neighbor Words The dependency parse tree for each sentence can be viewed as a graph. As compared to linear contexts where contexts are surrounding words of the target word, the dependency parse tree can offer informative contexts from words at a farther distance. We collect the words in the n-hop neighborhood as contexts, where n is a hyper-parameter. The co-occurrence counts of words and contexts are weighted by their distance. • Keywords The main meaning of one sentence can generally be expressed by several keywords in the sentence, such as subject, predicate, and object. Other words have fewer impacts, such as function words. Keywords carry the most important information of a sentence. Generally speaking, constructing contexts using keywords and paying more attention to them can provide more informative and robust contexts. We use the dependency relation to locate the keywords in a sentence. Through dependency parsing, each word is associated with its own dependency relation. Each relation represents a syntactic relation of a dependent against its head word. It also stands for the dependent word’s syntactic function in the sentence. Dependency relations have been classified by linguists into different categories based on their function, as shown in Table 4.1 [52]. We first exclude all stop words and punctuations in the dependency parse tree. 65 Then, we locate dependent words whose dependency relations are in the core arguments (or the root) and use them as keywords in the sentence. Examples are shown in Fig. 4.2, where keywords found by our method are marked in red. The main parts are captured by the keywords, like “semiconductor", “starts", “shipping", “chips" and so forth in the first sentence, and “researchers", “plan", “build" and “devices" in the second sentence. Other words provide supplementary information to make the sentence complete, and they can be discarded without harming the main meaning of the sentence. For example, in the second sentence, “100 tb tape storage" give us more details about the “devices". Nevertheless, other words still contain information toward the sentence meaning. So they are used as contexts if they are in neighboring hops. Besides the words in neighboring hops, keywords are always chosen as part of contexts. All context words are weighted by the distance (in terms of the number of hops) from the target word. Counting co-occurrence with the keyword contexts provides a simple context weighting scheme. Note that some keywords may play a dual role. That is, they may be neighbor words within n-hops as well. If this occurs, we double the count of the word-keywords context so as to increase its importance. 4.2.2 CEDWE Method The distribution of words may vary in different classes in text classification tasks. We define class-specific words as the words that typically occur in specific classes. Words that are not class-specific are more evenly distributed across all classes. Thus, class-specific words can be identified by measuring the difference between the word distributions and the uniform distribution. Class-specific words are one of the prominent features in text classification. To exploit class-specific words, our idea is to build a word-class co-occurrence matrix and use its row vectors as word features for classification. It is worthwhile to point out that, when we learn word embedding with the word-contexts co-occurrence statistics, the class information is incorporated in learned word embedding implicitly. This is because contexts, which are essentially words, are distributed differently in each class. However, it is still possible to 66 improve the classification accuracy furthermore to use the word class distribution explicitly to enhance the word embedding quality. This leads to the proposed class-enhanced dependency-based word embedding (CEDWE) method. It is detailed below. To inject the class information into the word-context matrix, we modify the raw word-context PPMI matrix, which is constructed by the whole dataset, using the word distribution in each class. Generally speaking, we compute the probabilities of words in each class and use them to extend the row vectors of the PPMI matrix. Mathematically, X denotes the raw word-context co-occurrence matrix. There are c classes of texts. The probability of word i in each class is ⃗pi = [pi1, pi2, · · · pic] where pij represents the probability that word i occurs in the j-th class. For word i, we first multiply its row vector X⃗ i in the PPMI matrix with its probability in each class and then concatenate the outputs for all classes to get an extended row vector in the form of X⃗ ′ i = [pi1X⃗ i , pi2X⃗ i , · · · , picX⃗ i ], (4.5) which is the row vector of the extended PPMI matrix X′ . Then, we apply SVD to the new matrix X′ for dimensionality reduction and get the learned word embeddings. After modifying the raw word-context PPMI matrix by the class information, the new PPMI matrix contains both the word-context information and the word-class information. The embedding of words that appear frequently in the same classes are closer in the embedding space. On the other hand, the embeddings for words that do not appear frequently in the same classes are pulled away even if they have similar contexts. Then, the learned word embeddings are more suitable for classification tasks. 4.3 Experiments In this section, we conduct experiments to show the effectiveness of the proposed DWE and CEDWE word embedding methods and benchmark them with several other popular word embedding methods. For DWE 67 and CEDWE, texts are parsed using the Stanza package [181]. Generally, the classification performance of DWE and CEDWE increases as the hop number becomes larger at the cost of higher complexity. Since the performance improvement is limited after 3 hops, we search contexts inside the 3-hop neighborhood to balance the performance/complexity trade-off. Table 4.2: Test accuracy comparison of several word embedding methods with the Logistic Regression classifier, where the best and the second-best results are displayed in boldface and with underscore, respectively. †: word embeddings are pre-trained on large corpora; ∗: word embeddings are trained on the text classification datasets. Word Embedding AG_NEWS DBpedia YahooAnswers Yelp.P Yelp.F Amazon.P Amazon.F word2vec† 89.08 96.63 68.21 88.77 52.95 84.30 46.87 GloVe† 89.64 96.85 67.78 86.79 51.10 82.38 44.76 EXT† 89.49 97.30 68.16 87.29 51.64 82.94 45.57 word2vec∗ 89.04 96.86 68.71 91.84 57.03 88.39 51.36 GloVe∗ 88.11 97.13 68.46 91.49 56.34 87.97 50.32 EXT∗ 89.57 97.31 68.55 91.45 56.36 86.95 49.65 ToWE∗ 88.53 95.40 69.57 91.60 56.52 88.16 50.48 PPMI/LC∗ 89.87 97.33 69.22 91.19 55.12 87.13 48.55 DWE∗ (Ours) 90.10 97.40 69.34 91.30 55.23 87.47 48.58 CEDWE∗ (Ours) 90.86 97.80 70.95 92.94 57.63 89.68 52.48 4.3.1 Datasets, Experiment Setup, and Benchmarks We adopt several large-scale text classification datasets from [291] to train our word embedding methods and conduct performance evaluation. Whenever possible, all results are obtained by averaging the results of 10 trials. For all dataset, the most 50K frequent tokens with more than 10 occurrences are selected as the vocabulary and the punctuation and stops words are excluded. • AG_NEWS. AG_NEWS is a 4-topic dataset extracted from AG’s corpus. Each topic has 30K training samples and 1.9K test samples. • DBpedia. DBpedia is a project aiming to extract structured content from the information in Wikipedia. The DBpedia text classification dataset is constructed using 14 topics from DBpedia. Each topic has 40K training samples and 5K test samples, where each sample contains the title and abstract of an article. 68 Table 4.3: Test accuracy comparison of several word embedding methods with the XGBoost classifier, where the best and the second-best results are displayed in boldface and with underscore, respectively. †: word embeddings are pre-trained on large corpora; ∗: word embeddings are trained on the text classification datasets Word Embedding AG_NEWS DBpedia YahooAnswers Yelp.P Yelp.F Amazon.P Amazon.F word2vec† 89.71 96.60 68.37 88.23 51.77 84.46 46.25 GloVe† 90.63 96.87 68.57 86.67 49.89 82.77 44.54 EXT† 89.89 97.12 68.31 86.66 50.31 82.82 44.74 word2vec∗ 90.29 96.77 68.97 91.07 55.20 87.84 49.75 GloVe∗ 89.82 97.16 69.09 90.92 55.02 87.62 49.28 EXT∗ 90.67 97.05 68.82 90.55 54.66 86.35 48.18 ToWE∗ 90.54 96.10 70.35 91.45 55.88 88.19 50.21 PPMI/LC∗ 90.65 97.36 69.86 89.87 53.77 86.55 47.99 DWE∗ (Ours) 90.87 97.45 69.95 90.11 54.22 86.78 48.06 CEDWE∗ (Ours) 91.75 97.88 71.80 92.19 57.03 89.28 51.85 Table 4.4: Classification accuracy results and the number of word-context sample pairs (in the unit of million) for the dependency-based contexts, where “DWE w/o K" means the proposed DWE method without the use of the keyword context. AG DBpedia Y.A. Yelp.P Yelp.F A.P A.F 3-hop DWE w/o K Accuracy 90.62 97.38 69.79 90.01 54.00 86.76 48.08 (sample pairs) (25.9M) (130M) (427M) (254M) (302M) (930M) (872M) 3-hop DWE Accuracy 90.87 97.45 69.95 90.11 54.22 86.78 48.06 (sample pairs) (33.9M) (141M) (516M) (293M) (348M) (1077M) (1011M) 5-hop DWE w/o K Accuracy 90.80 97.47 70.03 90.21 54.23 86.80 48.11 (sample pairs) (47.4M) (200M) (660M) (372M) (442M) (1348M) (1269M) • YahooAnswers. YahooAnswers is a 10-topic classification dataset extracted from the Yahoo! Webscope program. Each topic has 140K training samples and 5K test samples. Each sample has a question and its answer. • YelpReviewPolarity & YelpReviewFull. Yelp review is a sentiment classification dataset extracted from the 2015 Yelp Dataset Challenge. It has two sub-datasets: YelpReviewPolarity and YelpReviewFull. YelpReviewFull has 5 classes ranging from stars 1 to 5, where each class has 130K training samples and 10K test samples. In YelpReviewPolarity, stars 1 and 2 are treated as negative while stars 4 and 5 are viewed as positive. Each class has 280K training samples and 19K test samples. 69 • AmazonReviewPolarity & AmazonReviewFull. Amazon review is also a sentiment classification dataset built upon Amazon customer reviews and star rating. It has two sub-datasets: AmazonReviewPolarity and AmazonReviewFull. In AmazonReviewPolarity, each class has 600K training samples and 130K test samples. In AmazonReviewFull, each class has 1.8M training samples and 200K test samples. We compare the proposed DWE and CEDWE methods with eight benchmarking methods in Table 4.2 & 4.3. Among the eight benchmarking methods, three of them are pre-trained on large general corpora and five of them are trained on task-specific datasets. The pre-trained models include 1) word2vec [151], 2) GloVe [175], 3) Pre-trained Extended Dependency Skip-gram (EXT) [112]. The models trained on taskspecific datasets are: 1) word2vec, 2) GloVe, 3) EXT, 4) Task-oriented Word Embedding (ToWE) [141], 5) PPMI matrix with linear contexts (PPMI/LC). ToWE is based on word2vec and designed specifically for text classification. PPMI/LC is trained by PPMI matrix factorization with linear contexts. PPMI/LC is used to compare PPMI matrix factorization embedding methods with linear and dependency-based contexts. It can be viewed as a baseline of our model. The window size for word2vec, GloVe, ToWE, PPMI/LC and DWE is all set to ten for fair comparison. (a) AG_NEWS (b) DBpedia (c) YelpReviewPolarity Figure 4.3: The classification accuracy curves as a function of embedding dimensions for three datasets: (a) AG_NEWS, (b) DBpedia and (c) YelpReviewPolarity, where the tested dimensions are set to 50, 100, 200 and 300. We experiment with four different word embedding dimensions: 50, 100, 200, 300. A larger dimension often yields better performance in most datasets. We will show embedding dimension impacts on 70 Figure 4.4: The classification accuracy as a function of hop sizes for the AG_NEWS dataset, where the results are obtained by using only n-hop neighbor words in the dependency parse tree as contexts (namely, the keyword contexts are ignored), where n = 1, · · · , 6. performance later. For fair comparisons, we set the dimension of all word embedding methods to 300 if unspecified. For evaluation on text classification datasets, we leverage two classifiers for inference: logistic regression and XGBoost [36]. The averaged word embedding is used as the text representation, which serves as the input to the classifiers. There are more complex methods to obtain the text representation from word embeddings [247]. Here, since our goal is to test the quality of word embedding methods (rather than achieving state-of-the-art performance in text classification), the simplest method is utilized. 4.3.2 Results and Analysis Experimental results with the logistic regression classifier and the XGBoost classifier are shown in Tables 4.2 and 4.3, respectively. As compared with models trained on general corpora, models trained on task-specific corpus generally offer better performance. The improvement is obvious for the YelpReview and AmazonReview datasets. It is also worthwhile to point out that our DWE model outperforms PPMI/LC consistently across all datasets. After incorporating the class information in the word embedding learning process, CEDWE achieves the best performance. The performance improvement of DWE and CEDWE is primarily due to the design of a word embedding method to match its target task. The benefit of designing task-specific word embedding and training word embedding on task-specific datasets is clearly demonstrated by the experimental results. 71 (a) AG_NEWS: DWE (b) AG_NEWS: CEDWE Figure 4.5: Visualization of the embedding spaces of (a) DWE and (b) CEDWE for the AG_NEWS dataset. (a) YelpReviewPolarity: DWE (b) YelpReviewPolarity: CEDWE Figure 4.6: Visualization of the embedding spaces of (a) DWE and (b) CEDWE for the YelpReviewPolarity dataset. Effect of Embedding Dimension. Generally, word embeddings of a larger dimension have better classification performance. We show the classification accuracy curves as a function of word embedding dimensions with the XGBoost classifier in Fig. 4.3. Furthermore, we see a significant performance gap between CEDWE and other word embeddings when the dimension is lower. We can see our proposed method also performs well when the dimension is low. It indicates that the proposed CEDWE is a good choice when a lightweight model is essential in an application scenario. Linear vs. Dependency-based Contexts. We see from Tables 4.2 and 4.3 that there is a clear performance gap between the PPMI/LC and DWE. As compared to general word embeddings trained on large-scale corpora, word embedding methods trained for specific tasks usually have much less training texts. For such an environment, dependency-based contexts are more informative than linear contexts. The use of a syntactic dependency parser can make obtained contexts more robust and stable. 72 Effect of Keyword Contexts. It is observed in our experiments that the classification performance increases as the neighbor-hop size in dependency-based contexts (or the window size in linear contexts) increases. This is because more word-context pairs are collected with a larger hop number (or window size). Nevertheless, the improvement is diminishing when the hop size (or the window size) reaches a certain level as illustrated in Fig. 4.4, where the classification accuracy is plotted as a function of the hop size for the AG_NEWS dataset. Interestingly, we can leverage keywords and use them as extra contexts to allow a smaller hop size to reduce the computational complexity as shown in Table 4.4. We compare three ways to choose word contexts based on the dependency parse tree: 1. DWE with 3-hop neighbor contexts only; 2. DWE with 5-hop neighbor contexts only; 3. DWE with 3-hop neighbor contexts and keyword contexts; Since the default DWE has both neighbor and keyword contexts, we use the notation “DWE w/o K" (DWE without keyword contexts) to denote the first two cases. The classification accuracy results reported in the table are obtained using the XGBoost classifier. For AmazonReview dataset, the number of sample word-context pairs is already enough for hop size 3 and using more word-context pairs won’t increase the performance. For AG_NEWS, DBpedia, YahooAnswers, and YelpReviewFull datasets, we can see the effectiveness of using keywords as additional contexts. The performance of the second and third cases is close to each other while the number of sample pairs in the third case is significantly smaller than that in the second case. Effect of Explicit Class Information. Some words have similar contexts but appear in different classes in text classification. For example, adjectives in different classes can modify the same object in movie review datasets (e.g., “a nice movie”, “a funny movie”, “a disappointed movie”, “a terrible movie”). 73 General word embedding methods may have these adjectives closer since they have some similar contexts. This is, however, undesirable for classification tasks. The proposed CEDWE method takes the word class information into account in forming the word-context PPMI matrix to address this shortcoming. In the embedding space, the boundaries of class-specific words of different classes becomes clearer and words that frequently appear in the same class are pulled together. Task-specific words frequently appear in some specific classes so that they have higher occurrence probabilities in the corresponding classes. We use the chi-square test to select class-specific words and denote them with the class that has the highest occurrence probability. Then, t-SNE dimensionality reduction is used to visualize these task-specific words in the embedding space. The t-SNE plots of the embedding spaces of DWE and CEDWE for AG_NEWS and YelpReviewPolarity are shown in Figs. 4.5 and 4.6, respectively. As compared with DWE, class-specific words in different classes are better separated in CEDWE. This explains why CEDWE has better classification performance than DWE. 4.4 Conclusion and Future Work Two dependency-based word embedding methods, DWE and CEDWE, were proposed in this work. DWE uses keywords in sentences as extra contexts to build the word-context matrix. It provides informative contexts in a larger scope. As a result, compared with the scheme that only uses neighbor words as contexts, it achieves comparable text classification performance with less word-context sample pairs. To improve the text classification performance furthermore, CEDWE incorporates the word class distribution. The t-SNE plot visualization tool is also used to visualize the learned embedding, which can better illustrate the superiority of the proposed CEDWE model. As future extensions, it would be interesting to exploit more well-defined weighting function on contexts based on the dependency relation. It is also worthwhile to learn effective word embedding methods for intrinsic and extrinsic tasks that go beyond text classification. 74 Chapter 5 Word Mover’s Distance Computation and Its Application 5.1 Introduction Sentence similarity evaluation has a wide range of applications in natural language processing, such as semantic similarity computation [169], text generation evaluation [290, 297], and information retrieval [8, 262]. Methods for sentence similarity evaluation can be categorized into two main classes: 1) sentenceembedding-based methods and 2) word-alignment-based methods. The former finds vector representations of sentences and calculates the similarity of two sentences by applying a distance measure such as the cosine or the l2 distance. The latter operates at the word level and uses the alignment cost of corresponding words in two sentences as the sentence similarity measure. As one of the word-alignment-based methods, Word Mover’s Distance (WMD) [117] formulates text similarity evaluation as a minimum-cost flow problem. It finds the most efficient way to align the information between text sequences through a flow network defined by word-level similarities. By assigning flows to individual words, WMD computes text dissimilarity as the minimum cost of moving words’ flows from one sentence to another based on pre-trained word embeddings. WMD is interpretable as text dissimilarity is calculated as the distance between words in two text sequences. However, a naive WMD method does not perform well on sentence similarity evaluation for several reasons. First, WMD assigns word flow based on word frequency in a sentence. This frequency-based 75 Figure 5.1: The structure of a dependency parsing tree for an exemplary sentence: “He found a skinny and fragile dog in his backyard.” word weighting scheme is weak in capturing word importance when considering the statistics of the whole corpus. Second, the distance between words solely depends on the embedding of isolated words without considering the contextual information of words and the structural information of input sentences. Since the meaning of a sentence depends on individual words as well as their interaction, simply considering the alignment between individual words is deficient in evaluating sentence similarity. In this work, we propose an enhanced WMD method called the Syntax-aware Word Mover’s Distance (SynWMD). It exploits the structural information of sentences to improve the naive WMD for sentence similarity evaluation. A syntactic parse tree represents a sentence using a tree structure. It encodes the syntax information of words and the structural information of a sentence. The dependency parse tree (see an example in Fig. 5.1) is one type of the syntactic parse tree. Each node in the tree represents a word, and an edge represents the dependency relation of two connected words. Thus, words’ related contexts can be well captured by the structures of the dependency parse tree. For example, dog in Fig. 5.1 is one of the most related contexts of found as its objective. Such a relationship can be easily spotted by the dependency parse tree. In contrast, skinny and fragile are not directly related to found because they are the modifiers of dog. They are far away from found in the dependency parse tree although they are close to found in the sequential order. The dependency parse tree provides valuable information in semantic modeling and is proven useful in various NLP applications, such as word embedding [123, 269], semantic role labeling [222], machine translation [164], and text similarity tasks [184, 263]. 76 In this paper, we present SynWMD, a novel approach to improve the performance of sentence similarity evaluation by incorporating the dependency parse tree technique in both word flow assignment and word distance modeling. Firstly, we propose a new syntax-aware word flow calculation method, which represents words as a weighted graph based on co-occurrence statistics obtained from dependency parsing trees. A PageRank-based algorithm is then employed to infer word importance. Secondly, we enhance the word distance model in WMD by leveraging the contextual information extracted from dependency parse trees. Specifically, SynWMD models the contextual information of words and the structural information of sentences as additional subtree embeddings. Finally, we conduct extensive experiments on semantic textual similarity tasks, sentence classification tasks, and sentence re-ranking tasks to evaluate the effectiveness of SynWMD. Our experimental results demonstrate that SynWMD outperforms state-of-the-art sentence similarity models. The code for SynWMD is available at https://github.com/amao0o0/SynWMD. The rest of this chapter is organized as follows. SynWMD is proposed in Sec. 5.2. Experimental results are shown in Sec. 5.3. Finally, concluding remarks are given in Sec. 5.4. 5.2 Methodology In this section, we present a review of WMD and, then, introduce two syntax-aware components of SynWMD; namely, Syntax-aware Word Flow (SWF) and Syntax-aware Word Distance (SWD). 5.2.1 Word Mover’s Distance Inspired by the Wasserstein metric, WMD measures text similarity using the optimal transport distance. It first utilizes pre-trained word embeddings to compute the distance between words in two text sequences. Let xi be the embedding of word i. WMD defines the distance between word i and word j as c(i, j) = ||xi − xj ||2, which is also referred to as the transport cost from word i to word j. Next, WMD assigns a 77 flow to each word. The amount of flow fi of word i is defined as the normalized word occurrence rate in a single text sequence: fi = count(i) |f| , |f| = X i count(i). (5.1) where |f| is the total word count of a text sequence. Then, WMD measures the dissimilarity of two texts as the minimum cumulative transport cost of moving the flow of all words from one text sequence to the other. It can be formulated as the following constrained optimization problem: min Tij≥0 X i∈I X j∈J Tijc(i, j) (5.2) subject to: X j∈J Tij = fi ∀i ∈ I, and X i∈I Tij = f ′ j ∀j ∈ J, (5.3) where I and J are the sets of words in the two text sequences, respectively, and Tij represents the amount of the flow that travel from word i to word j, which is a variable to be determined. The above constrained optimization problem can be solved by linear programming. WMD has two main shortcomings. First, important words in a sentence should be assigned a higher flow based on Eq. (5.2). Yet, WMD assigns word flow according to word occurrence rate in a sentence. This simple scheme cannot capture word importance in a sentence accurately. Second, the transport cost between two words is solely decided by their word embeddings. Nevertheless, the meaning of a word may be affected by its context and the meaning of a sentence can be affected by the structure of word combinations. It is desired to develop more effective schemes. They are elaborated in Secs. 5.2.2 and 5.2.3. 5.2.2 Syntax-aware Word Flow Important words in two sentences can largely decide their similarity. As given in Eq. (5.2), a word with a higher flow has a greater impact on WMD results. Thus, a more important word in a text sequence should 78 be assigned with a higher flow. We propose an enhanced word flow assignment scheme called syntaxaware word flow (SWD). Given a dataset, we compute the word co-occurrence frequency in dependency parse trees using existing parsers [181] and obtain word importance based on the co-occurrence statistics for flow assignment. The computation of SWD is detailed below. 1. Parse all sentences in a dataset and count the co-occurrence number of two words if they appear in a parse tree within n-hop. The co-occurrence count is further weighted by the distance between two words in a parse tree; namely, it is divided by the hop number between two words. 2. Build a weighted graph for the dataset, where each node corresponds to a word and the edge of two connected nodes is weighted by their co-occurrence number as computed by Step 1. When words are associated with higher edge weights, they co-occur with other words in the dataset more frequently. These words are viewed as less important in sentences as they are more predictable. Under this assumption, a word with a higher total edge weight should be assigned a lower word flow. 3. Use the weighted PageRank algorithm [174] to count all edge weights of a node, which gives a rough estimate of node importance, and assign the inverse of the PageRank value as its word flow. The last step can be written as P R(i) = (1 − d) + d · X j=1 wij P R(j) P k=1 wjk , (5.4) fi = 1 P R(i) , (5.5) where wij is the edge weight between word i and word j, P R(i) is the PageRank value of word i, which is iteratively calculated by the weighted sum of the PageRank values of its neighboring words, and d is a parameter used to control the smoothness of word flow. In this way, SWF can assign a word that co-occurs with other words more frequently in the parse tree a lower flow. 79 Figure 5.2: Illustration of the shortcoming of distance calculation in WMD and the improved SWD solution. The distance between words in SWD is decided by word embeddings and subtree embeddings. Figure 5.3: 1&2-hop subtrees with open as the parent node in the dependency parsing tree for sentence: “I am not sure if you can open a bank account in France”. Stopwords are ignored in the figure. 5.2.3 Syntax-aware Word Distance In WMD, the distance between words is called the transport cost. It is computed using static word embedding without considering any word contextual information or structural information of the sentence. An example given in Fig. 5.2 is used to illustrate its shortcoming. Two identical words bank in the two sentences have the distance of zero in WMD. However, they do not have the same meaning because of different contexts. We can exploit contextual word embedding to alleviate this problem, such as BERTbased models. Here, we propose a syntax-aware word distance (SWD). Simply speaking, SWD uses the dependency parse tree to find the most related context of words and incorporates this information in word 80 distance calculation. SWD can be applied to both static and contextual word embeddings to improve the performance of WMD. The procedure of SWD is detailed below. 1. Generate candidate subtrees from a dependency parse tree. For each word in a tree, we treat it as a parent node, and use itself and its connections to m-hop children to form the subtrees, where m is a hyper-parameter. With children from different hops, the contextual information from multiple levels can be extracted by the subtrees. Fig. 5.3 shows 1-hop and 2-hop subtrees where the word "open" is the parent node. 2. Collect all subtrees that contain the word as its context. Then, obtain the subtree embedding as the weighted average of all its word embeddings. 3. Incorporate the context of the target word in the word distance calculation. As shown in Fig. 5.2, besides distances between word embeddings, distances between subtree embeddings are also incorporated. For the last step, the syntax-aware word distance between words i and j can be computed by c(i, j) = dist(vi , vj ) + a P si∈Si P sj∈Sj dist(si , sj ) |Si | · |Sj | , (5.6) where Si and Sj are the sets of subtrees that contain word i and j, respectively, and a is a parameter controlling the amount of contextual and structural information to be incorporated. The cosine distance, dist(vi , vj ) = 1 − < vi , vj > ||vi || · ||vj ||, (5.7) is used to measure the distance between word embeddings and subtree embeddings. 81 5.3 Experiments We evaluate SynWMD on eleven datasets, which include six semantic textual similarity datasets, four sentence classification datasets with the k-nearest neighbor classifier, and one sentence re-ranking dataset. In all experiments, the dependency parse trees are obtained using the Stanza package [181]. 5.3.1 Semantic Textual Similarity Table 5.1: Spearman’s (ρ × 100) correlation comparison of unsupervised methods, where the best results of each word embedding are displayed in boldface. The number in the bracket show the performance gain or loss of our methods as compared with WMDcos+IDF. Results of [† ] are taken from [70]. Embeddings Methods STS12 STS13 STS14 STS15 STS16 STS-B Avg. word2vec(avg.) Sent. Emb. 55.28 70.09 65.53 75.29 68.73 65.17 66.68 BERT(first-last avg.)† 39.70 59.38 49.67 66.03 66.19 53.87 55.81 BERT-flow† 58.40 67.10 60.85 75.16 71.22 68.66 66.90 BERT-whitening† 57.83 66.90 60.90 75.08 71.31 68.24 66.71 CT-BERT† 61.63 76.80 68.47 77.50 76.48 74.31 72.53 SimCSE-BERT† 68.40 82.41 74.38 80.91 78.56 76.85 76.92 word2vec WMDl2 58.12 58.78 60.16 71.52 66.56 63.65 63.13 WMDcos 54.82 61.42 60.71 72.67 66.90 62.49 63.30 WRD 56.72 64.74 63.44 75.99 69.06 65.26 65.87 WMDl2+IDF 60.36 67.01 63.06 72.41 68.30 65.91 66.18 WMDcos+IDF 57.64 69.25 63.81 73.50 68.83 65.51 66.61 SynWMDSW F 60.24 74.71 66.10 75.94 69.54 66.24 68.80 (↑2.19) SynWMDSW F +SW D 60.30 75.43 66.22 75.95 70.06 66.65 69.10 (↑2.49) BERT(first-last) WMDl2 53.03 58.96 56.79 72.11 63.56 61.01 60.91 WMDcos 55.38 58.51 56.93 72.81 64.47 61.80 61.65 WRD 49.93 63.48 57.63 72.04 64.11 61.92 61.52 BERTScore 61.32 73.00 66.52 78.47 73.43 71.77 70.75 WMDl2+IDF 61.19 68.67 63.72 76.87 70.16 69.56 68.36 WMDcos+IDF 63.79 69.25 64.51 77.58 71.70 70.69 69.59 SynWMDSW F 66.34 77.08 68.96 79.13 74.05 74.06 73.27 (↑3.68) SynWMDSW F +SW D 66.74 79.38 69.76 78.77 75.52 74.81 74.16 (↑4.57) SimCSE-BERT WMDl2 64.66 79.72 73.12 81.25 76.69 77.53 75.50 WMDcos 65.43 80.00 73.35 81.21 76.97 77.18 75.69 WRD 64.80 80.97 74.13 80.71 76.68 78.47 75.96 BERTScore 66.31 82.87 75.66 83.14 79.16 80.03 77.86 WMDl2+IDF 67.35 81.36 74.56 82.29 78.12 79.18 77.14 WMDcos+IDF 68.47 81.76 74.98 82.30 78.29 78.98 77.46 SynWMDSW F 70.20 83.36 76.17 83.16 78.81 80.02 78.62 (↑1.16) SynWMDSW F +SW D 70.27 83.44 76.19 83.21 78.83 79.98 78.66 (↑1.19) 82 Datasets. Semantic similarity tasks are widely used to evaluate sentence similarity assessment methods. Here, we consider six semantic textual similarity (STS) datasets, including STS2012-16 and STSBenchmark. Sentence pairs in STS are extracted from a wide range of domains such as news, web forum, and image captions. They are annotated with similarity scores by humans. Each STS dataset contains several subsets on different topics. Since it is likely to have data from different topics in real-world scenarios, we apply the “all setting” evaluation for STS2012-16 as mentioned in [70]. The similarity scores of the sentence pairs in different subsets are concatenated and the overall Spearman’s correlation is reported. Benchmarking Methods. We choose the following benchmarking methods. • Sentence-embedding-based methods: 1) average methods: the average of word2vec embedding [151] and the average of the first and last layers of BERT [57], 2) post-processing methods: BERT-flow [127] and BERT-whitening [223], 3) contrastive learning methods: CT-BERT [30] and SimCSE-BERT [70]. • Word-alignment-based methods: original WMD, Word Rotator’s Distance [287], BERTScore [290], and WMD with IDF weights as the baselines. For exhaustive comparison, WMD using the l2 and cosine distance are both reported. Both non-contextual and contextual word embeddings are chosen as backbone models. They are word2vec, pre-trained BERT, and SimCSE. Experimental Setup. In the implementation of SWF, we count word co-occurrence in dependency parse trees if they are within 3 hops, and set the smooth term d = 0.2. In the implementation of SWD, we create subtrees with child nodes of no more than 3 hops. We set a = 0.2 for word2vec and SimCSE word embeddings and a = 1.0 for BERT word embedding. Isotropic Processing. It is observed in [63] that the average cosine similarity between randomly sampled words with pre-trained contextual word embedding is high. This implies that pre-trained contextual word embeddings are confined to a cone space, and they are not isotropic. It is also shown in [70, 248] that 83 the anisotropic property of pre-trained contextual word embedding hurts its performance in sentence similarity tasks severely. Post-processing methods (e.g., whitening) make BERT embedding less anisotropic in the embedding space and improve the performance in semantic similarity tasks. Thus, when BERT embeddings are used in the experiments, we perform the whitening operation on the word level for all word-alignment-based methods. Results. We compare a wide range of methods on 6 STS datasets and report their Spearman’s correlation results in Table 5.1. For all word embeddings, WMD and WMD+IDF perform better with the cosine distance than the l2 distance. This indicates that the cosine distance is a better metric for STS datasets. Furthermore, the word flow assignment with the IDF weight can enhance the performance of WMD. As to our proposed method, SynWMD+SWF outperforms other alignment-based methods by a substantial margin. SynWMD+SWF+SWD can improve SynWMD’s performance even more. This is especially obvious for word2vec and BERT embeddings. Under the same word embedding, SynWMD always outperforms sentence embedding methods, including the state-of-the-art unsupervised method, SimCSE. (a) word2vec (b) BERT (c) SimCSE-BERT Figure 5.4: The average Spearman’s correlation curves on STS datasets as a function of hop sizes or window sizes for three word embeddings: (a) word2vec, (b) BERT and (c) SimCSE-BERT. 5.3.2 Further Analysis on STS We perform an ablation study on SynWMD to offer a better understanding of its working principle in this subsection. Effect of the hop size. We study the sensitivity of hop sizes, n, in collecting word co-occurrence statistics in SWF. The blue curves in Fig. 5.4 show the average performance trend with different hop 84 sizes on STS datasets. We see from the figure that SynWMD+SWF with a larger hop size gives better performance. This is because more relationships between words are incorporated for a larger n. However, the performance gain saturates as n ≥ 3. Difference between parse tree and linear context in SWF. SWF collects co-occurrence statistics from dependency parse trees, which are well-organized structures of sentences. One can also use a sliding window to collect co-occurrence statistics from linear contexts and build the weighted graph. The differences between these two schemes are shown in Fig. 5.4. We see from the figure that the dependency parse tree in SWF outperforms the sliding window. This is because the dependency parse tree provides a powerful syntactic structure in collecting word co-occurrence statistics. Difference between subtree and n-grams in SWD. When collecting contextual information from words’ neighbors in SWD, one can replace subtrees with n-grams in Eq. (5.6). We study the difference between subtrees and n-grams with BERT embeddings. We generate 2-grams and 3-grams so that the number of n-grams has the same order of magnitude as subtrees’ in our experiments. All other experimental settings remain the same. The performance difference between subtree and n-grams is shown in Table 5.2. We can see from the table that the sentence structural information does perform better than n-gram features. Table 5.2: Comparison of Spearman’s (ρ × 100) correlation of using the subtree and the n-gram in SWD. Datasets n-gram subtree STS12 66.37 66.64 STS13 78.08 79.40 STS14 69.36 69.75 STS15 79.29 78.82 STS16 74.41 75.51 STS-B 74.67 74.93 Avg. 73.70 74.18 Effect of using different backbone word embedding models. As shown in Table 5.1, there is more performance improvement by applying SWD to word2vec and BERT word embeddings but less to SimCSE. 85 Figure 5.5: The averaged pairwise cosine distance of words in a sentence of STS datasets with three embeddings. Figure 5.6: Visualization of the word flow assigned by SWF, where weights are normalized. The higher the weight, the darker the color. One possible explanation for this phenomenon is that SimCSE word embeddings in a sentence tend to be similar. When words from a sentence have close embeddings, words and their subtrees are expected to have close embeddings. As a result, word distances keep a similar ratio even with the subtree distance, and results of the constrained optimization problem, i.e., Eq. (5.2), do not change much. To verify this point, we calculate the averaged pairwise cosine distance of words in a sentence with three word embeddings and show the results in Fig. 5.5. We see that BERT has the largest average distance while SimCSE has the smallest. This is consistent with their performance improvement. 86 (a) Without SWD (b) With SWD Figure 5.7: Visualization of the word distance between “bank” in sentence S1 and words in sentence S2. Sentence S1: “We camped near the bank of the river.” Sentence S2: “I am not sure if you can open a bank account in France.” The darker the color, the larger the distance. 5.3.3 Sentence Classification & Re-Ranking To further validate the effectiveness of SynWMD, we perform experiments on sentence classification and sentence re-ranking tasks. Datasets. For the k-nearest neighbor sentence classification task, we choose MR, CR, SST2, and SST5 sentence classification datasets from SentEval [44]. They are elaborated below. • MR: a movie review dataset where sentences are labeled as positive or negative sentiment polarities. • CR: a product review dataset with positive and negative sentence reviews. • SST2 & SST5: Both are movie review datasets. SST2 has two labels (positive and negative), while SST5 has five labels (very positive, positive, neutral, negative, and very negative). Note that WMD-based methods are not suitable for the k-nearest neighbor sentence classification with a large number of samples. For SST2 and SST5 datasets, only test samples are used and cross-validation is performed. They are denoted by SST2-test and SST5-test, respectively. For the re-ranking task, we choose the popular AskUbuntu dataset [14, 254], where 20 candidate questions are to be re-ranked based on the similarity with an input question. Only the test data is used for unsupervised WMD-based methods. 87 Table 5.3: Comparison of test accuracy for the k-nearest neighbor sentence classification. The best results of each dataset are displayed in boldface. Methods MR CR SST2-test SST5-test WMDl2 67.68 73.69 66.12 31.81 WMDcos 70.89 75.18 69.36 34.76 WRD 73.17 75.74 72.99 35.25 WMDl2+IDF 70.17 75.44 74.41 31.49 WMDcos+IDF 74.18 76.88 74.41 37.96 SynWMD 76.44 77.08 77.43 38.28 Table 5.4: Experimental results on the AskUbuntu dataset with four rank-based evaluation metrics: 1) Mean Average Precision (MAP), 2) Precision@1 (P@1), 3) Precision@5 (P@5), and 4) Mean Reciprocal Rank (MRR). The best results are displayed in boldface. Methods MAP p@1 p@5 MRR WMDl2 50.27 46.24 38.17 62.81 WMDcos 50.49 47.31 37.63 63.28 WRD 46.43 44.09 36.13 59.69 WMDl2+IDF 51.49 49.46 38.49 65.94 WMDcos+IDF 51.94 49.46 38.71 64.95 SynWMD 52.29 51.08 38.17 65.99 Benchmarking Methods. We compare SynWMD with 3 other WMD-based methods. They are: 1) original WMD, 2) Word Rotator’s Distance and 3) WMD with IDF weight. Results of WMD using the l2 and cosine distances are reported. Word2vec is used as the backbone word embedding model in this experiment. Experimental Setup. We set d = 0.1 and a = 0.1. All other settings remain the same as those in the STS tasks. The k value for the nearest neighbor classifier is chosen from 1 to 30 to achieve the best performance. Results: Experimental results are shown in Tables 5.3 and 5.4. The cosine distance is better than the l2 distance in all four sentence classification datasets. SynWMD outperforms other WMD-based methods by a large margin in the k-nearest neighbor sentence classification. SynWMD also outperforms other WMD-based methods on the AskUbuntu dataset on average. 88 5.3.4 Visualization of SynWMD The effectiveness of SWF and SWD is visualized in Figs. 5.6 and 5.7, respectively. First, we show the word flow assigned by SWF in Fig. 5.6. Since SWF examines word co-occurrence in the syntactic parsing tree, it assigns higher weights to important words (e.g., nouns) and lower weights to highly predictable words (e.g., prepositions). Second, we show the word distance (i.e., the transport cost) before and after using SWD in Fig. 5.7. For the example given in Fig. 5.7, the sentence distance is calculated by the cost of moving the flow of words between S1 and S2. We see that the distance between word “bank" in S1 and S2 becomes larger. This is because SWD considers the contextual information for each word. Thus, the measured similarity between two sentences is smaller with SWD. It is the desired result. The visualization given in Figs. 5.6 and 5.7 illustrate the working principle of SynWMD. 5.4 Conclusion and Future Work An improved Word Mover’s Distance (WMD) using the dependency parse tree, called SynWMD, was proposed in this work. SynWMD consists of two novel modules: syntax-aware word flow (SWD) and syntaxaware word distance (SWF). SWD examines the co-occurrence relationship between words in parse trees and assigns lower flow to words that co-occur with other words frequently. SWD is used to capture word importance in a sentence using the statistics of the whole corpus. SWF computes both the distance between individual words and their contexts collected by parse trees so that words’ contextual information and sentences’ structural information are incorporated. SynWMD achieves state-of-the-art performance in STS tasks and outperforms other WMD-based methods in sentence classification tasks. As future extensions, we may extend the idea beyond sentence-level by leveraging sentence embeddings and incorporate it into sentence-embedding-based methods to lower the computational cost [116]. 89 Chapter 6 Unsupervised Compressive Summarization 6.1 Introduction Text summarization condenses long document(s) into short and human-read versions while retaining the most important information. Due to text summarization’s ability to improve efficiency and save time in text processing by condensing large documents into a more concise format, it is important in many real-world fields beyond summarization itself. For example, in the news and media industry [81, 205], text summarization can be used to provide readers with a summary of complex articles, enabling them to quickly and efficiently understand the main points without having to read the entire text; in dialogue systems [252, 253], summarization can be used to provide users with concise responses to their queries, enabling more efficient communication; in information retrieval [158], summarization can help users to quickly identify the most relevant content within a large set of documents, improving the overall efficiency of the retrieval process. Text summarization can be divided into two primary approaches, namely extractive and abstractive text summarization. Extractive approaches generate summaries by selecting phrases or sentences from the source document(s). In contrast, abstractive approaches express the key information using different words or phrases that may not be present in the source document(s). Extractive approaches are generally considered to be easier to build and more faithful and factualconsistent, as they directly select phrases and sentences from the source documents [75, 89]. In addition, 90 as extractive methods inherit content directly from the source, they require less model complexity and have fewer training difficulties, as there is no need for an extra text generation stage. Extractive summarization methods [136, 172, 298, 299] are typically applied at the sentence level, where the most important sentences are identified and used to generate the summary. However, an alternative extractive approach is the compressive method, which extracts or prunes phrases within sentences to generate the summary. This approach offers several advantages over traditional sentence-level extraction, such as the ability to remove redundant or irrelevant sub-sentential content and generate more concise summaries. Furthermore, compressive summarization is particularly effective when working under tight length constraints, as it can help ensure that the summary fits within the desired length. Despite these advantages, most compressive summarization methods [56, 149, 276] require labeled data to learn specific rules for selecting or pruning phrases. Unfortunately, creating large, high-quality labeled datasets for summarization is challenging, particularly for different domains and languages. As a result, there is a need for new compressive methods and techniques that can learn to summarize text without relying on extensive labeled summarization datasets. This work aims to investigate the importance of compressive summarization and develop an unsupervised approach to this technique. First, we utilize Oracle algorithms to examine the impact of compressing the selected sentences on metrics such as ROUGE score, fluency, and compression ratio. The results indicate that further compression of selected sentences can significantly enhance the performance of extractive summarization models. However, in order to ensure that the resulting summaries maintain fluency, it is important to consider syntactic rules during the compression process. Then, by studying the relationships between sentences and words within the document using sentence and word embeddings, we design ranking functions that can identify and remove irrelevant sub-sentential units, resulting in more concise summaries. Our experimental results demonstrate that by adding a phrase extraction stage to the traditional sentence-level extractive methods, our methods can significantly improve their performance. 91 The rest of this chapter is organized as follows. The impact of further compression on selected sentences is studied in Sec. 6.2. Our unsupervised compressive summarization method is proposed in 6.3 Experimental results are shown in Sec. 6.4. Finally, concluding remarks and future work directions are given in Sec. 6.5. 6.2 Compressibility Study Compressive text summarization further compresses selected sentences by deleting sub-sentential units. Compressibility refers to the potential for improving performance through this technique. Oracle, which is the upper bound of extractive methods, selects the parts of the source text to maximize the ROUGE score with respect to the reference summary. The compressibility can be verified by the designs of Oracle. Compressibility of recently developed text summarization datasets, have been studied in [83, 149, 276, 301]. [276] forms the decision binary labels by first sentence extractive oracle and, then, the phrase oracle by checking ROUGE score increase with or without phrases. [149] builds a compressive oracle by training a supervised sentence compression model which can decide whether keep a span in a sentence. However, while most research has focused on the improvement of ROUGE scores for compressed summaries generated from the extraction of words or phrases, the effects of sub-sentential extraction on other aspects, such as fluency, naturalness, and compression ratio, have not been thoroughly investigated. In this section, we design Oracles to comprehensively study the potential for improvement brought by compressive summarization and to identify important factors to consider when extracting sub-sentential units. 6.2.1 Oracle Design We propose three different settings of Oracles where the sum of ROUGE-1 and ROUGE-2 is maximized. • Sentence Oracle: This Oracle greedily selects the sentences. For each sample, the optimal number of sentences is determined based on the best-performing set of sentences. 92 • Sentence & N-gram Oracle: This Oracle first greedily selects sentences, which is the same as Sentence Oracle. Then it greedily selects n-grams from the selected sentences to achieve the best ROUGE score. • Syn-Cons. Oracle (Syntactic constraint Oracle): This Oracle first greedily selects sentences, and then greedily remove phrases from the selected sentences under syntactic compression rules to achieve the best ROUGE score. The syntactic compression rules ensure the grammaticality of the final outputs. We follow the rules proposed by [56], which relies on constituency parsing. Oracle CNNDM XSUM PubMed R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L Sentence Oracle 53.82 30.19 42.59 30.70 8.77 23.01 62.89 36.31 58.18 Sent & 1-gram Oracle 73.43 38.86 48.49 50.56 14.67 36.61 77.14 40.31 69.96 Sent & 3-gram Oracle 62.53 37.28 47.29 40.49 12.15 28.80 69.31 41.40 63.57 Syn-Cons Oracle 60.07 34.84 48.04 35.24 10.60 26.82 67.56 39.18 62.46 Table 6.1: ROUGE score of Oracles on datasets CNN/DM, XSUM and PubMed Oracle CNNDM XSUM PubMed PPL SLOR PPL SLOR PPL SLOR Reference 203.47 3.17 90.04 3.31 42.05 4.97 Sentence Oracle 106.05 3.40 99.06 2.92 45.44 4.74 Sent & 1-gram Oracle 646.57 1.70 2233.34 -1.25 122.00 3.40 Sent & 3-gram Oracle 362.30 2.36 608.58 1.11 95.16 4.04 Syn-Cons Oracle 158.77 3.03 222.84 2.16 45.44 4.41 Table 6.2: Flunecy measurement of Oracles. A lower PPL and a higher SLOR indicate better fluency. Oracle CNNDM XSUM PubMed Avg. Len. Comp. Ratio Avg. Len. Comp. Ratio Avg. Len. Comp. Ratio Sentence Oracle 72.54 14.5 34.49 13.2 229.91 10.2 Sent & 1-gram Oracle 33.37 6.7 9.22 3.3 143.52 6.4 Sent & 3-gram Oracle 37.33 7.5 16.94 6.4 150.82 6.8 Syn-Cons Oracle 54.10 10.8 24.04 9.2 177.53 8.0 Reference 54.68 11.2 23.19 9.5 208.028 9.8 Table 6.3: Summary average length and Compression ratio (in terms of the word) 93 Oracle Exemplary sentences Sentence Oracle Members of the public told police the shot was fired at a dark coloured car by a white man in a grey hooded top who was on foot . Sent & 1-gram Oracle police shot was fired at a car a in . Sent & 3-gram Oracle public told police the shot was fired at a dark coloured car man in a foot . Syn-Cons Oracle Members told police the shot was fired at a car by a man in a hooded top . Reference A shot was reportedly fired at a car outside a primary school in Liverpool as parents were taking their children inside, police have said . Table 6.4: ROUGE SCORE of Oracles The objective of our study is not to design Oracle to maximize the upper bound of the ROUGE score. Rather, we aim to comprehensively evaluate the performance of various Oracles based on multiple metrics, including ROUGE score, compression ratio, and fluency. 6.2.2 Datasets and Evaluation Metrics We implement the compressibility study on the following datasets:1) CNN/DM [81] is a large-scale text summarization dataset that contains 93k articles from CNN, and 220k articles from Daily Mail newspapers. The articles in the CNN/DM cover a wide range of topics, including politics, sports, entertainment, and technology. 2) XSUM [162] contains 226k articles published by BBC. Each article is paired with a single summary that highlights the key points. Thus, a concise summary is particularly desired in this dataset. 3) Pubmed [43] is a long document dataset of scientific publications from PubMed (215k). The test sets of datasets are used in this compressibility study. Evaluation Metrics: First, the summary should retain important information and be coherent with the source text. We use the ROUGE score to measure content overlapping between extracted texts and reference texts. Second, the output summary should also be easy to read without misunderstanding, which we refer to as fluency (some papers also call it naturalness, grammatically, or readability). In this study, we report two automatic fluency measurements: perplexity and syntactic log-odds ratio (SLOR)[101]. Perplexity and SLOR are calculated by the following formulas: 94 P P L(S) = e − 1 |S| ln(pM(S)) SLOR(S) = 1 |S| (ln(pM(S)) − ln(Y t∈S p(t)) where pM(S) is the probability of sentence S assigned by the LM and p(t) is the unconditional probability of unigram t. Perplexity is a widely used metric for language modeling and measures the degree of uncertainty in predicting the next word given the previous words. The lower the perplexity, the more fluent and natural the language of the summary. SLOR, on the other hand, is a metric proposed by [101] that takes into account the impact of unigram probabilities on fluency measurement. To avoid the influence of unigram probability on the fluency measure, SLOR subtracts the unigram probability. A higher SLOR value indicates greater fluency and naturalness in the language of the summary. The Language model that we use to calculate perplexity and SLOR is GPT-2. Lastly, the average length and compression ratio of summaries are reported. 6.2.3 Experimental Results and Analysis From the ROUGE scores as shown in Table. 6.1, Sentence & N-gram Oracles perform the best since they can select words without any constraint to maximize the ROUGE score. In addition, it can be observed that there is a significant performance gap between Sentence Oracle and Syn-Constraint Oracle. Fluency evaluations, including the results of reference texts, are shown in Table. 6.2. The Sentence Oracle and reference texts provide a reasonable range of PPL and SLOR. The results show Syn-Constraint Oracle has a normal fluency score. In contrast, sentence & N-gram Oracles are far worse than others, indicating the summaries are unreadable. Table. 6.4 provides exemplary sentences of these three different types 95 of Oracles. It shows how Syn-Constraint Oracle removes irrelevant information in the sentence without hurting the fluency of the sentence. By contrast, the summaries generated by Sentence & N-gram Oracles have many grammatical errors, rendering them challenging to comprehend. Finally, the average length and compression ratio of summaries are shown in Table. 6.3, Sentence & N-gram Oracles greedily delete words to achieve higher ROUGE scores without any limits. Syn-Constraint Oracle achieves a similar length and compression ratio as the references. It generates more concise summaries compared with Sentence Oracle. The experimental results of the compressive study suggest that compressive text summarization has the potential to generate a concise summary with improved Rouge scores, making it a desirable approach for summarization tasks where length limitations are a crucial consideration. However, syntactic guidance must be incorporated to achieve fluent summaries which are readable to humans. 6.3 Methodology We present an unsupervised technique for compressive summarization using both word and sentence embeddings. It consists of two stages, namely sentence extraction and phrase extraction. We denote our method as Embedding-based Compressive Summarization with Two-stage Extraction (ECSTE). The phrase extraction can be employed with any sentence-level extraction method, enhancing the performance of a sentence extraction model. In this work, we utilize PacSum [298], a positional-aware centrality ranking model, for sentence extraction. In the following, we provide an introduction to the positional-aware centrality ranking model for sentence extraction, as well as our proposed phrase extraction stage. 6.3.1 Sentence extraction Given a document D which is a set of sentences s1, s2, ..., sN . The sentence embeddings of sentences si and sj are denoted as sei and sej , respectively. The sentence pairwise similarity ssimij between si and 96 sj is obtained by the dot product of their sentence embeddings < sei , sej >. The centrality of sentence si is then determined by summing up the pairwise similarities with all the other sentences in the document, which can be expressed as: Centrality(si) = X N j=1 ssimij In many summarization datasets, there exists a position bias, where crucial information is frequently located in specific positions, such as the beginning or end of the document. To account for this bias, a position-aware centrality measure is proposed, which considers the position information of sentences. Specifically, the position-aware centrality P C(si) for a sentence si is calculated as: P C(si) = λ1 X i>j ssimij + λ2 X i<j ssimij where λ1 and λ2 are the hyper-parameters that control the weights of similarity from preceding and succeeding sentences, respectively. Next, sentences are ranked according to their position-aware centrality scores, and the top sentences are selected as the document’s summary. 6.3.2 Phrase extraction After the initial stage of sentence selection, final concise summarization is achieved by removing irrelevant phrases to obtain concise summaries. We first define word importance based on the word centrality. Subsequently, along with syntactic compression rules, the word importance identifies which phrases should be removed. 97 Word centrality is computed by examining the similarity of a word to other words present in the document. Similar to the computation of sentence centrality, word centrality is determined by summing the similarities to other words. Suppose a document D contains a set of M words, represented as w1, w2, ..., wM. The word-level centrality of a word wi is calculated as: Centrality(wi) = X M j=1 wsimij where wsimij is the similarity between word wi and wj calculated by word embeddings. Next, the process of constituency parsing is applied to identify the removable phrases that do not affect the fluency of the sentences. As in Sec. 6.2, we employ the deletion rules proposed by [56]. Specifically, the following phrases are deletable: (1) parentheticals and fragments; (2) adjectives and adjectival phrases; (3) adverbs and adverbial phrases; (4) prepositional phrases; (5) appositive noun phrases ; (6) relative clauses; and (7) conjoined noun phrases, verb phrases, and sentences. Fig. 6.1 shows deletable phrases chosen by the rules for an exemplary sentence. These deletable phrases then will be decided whether to be removed or not based on their importance. The score of a phrase pj is then computed by averaging the word centrality of its constituent words as: Score(pj ) = X |pj | i Centrality(wi)/|pj | where |pj | is the number of words in phrase pj . Finally, the phrases with scores below a certain threshold are removed to produce the final summary. 98 Figure 6.1: Phrases that can be removed in the sentence: “the shot was fired at a dark coloured car by a white man .” PP = prepositional phrase, ADJP = adjective phrase, JJ = adjective 6.4 Experiments 6.4.1 Datasets, Experiment Setup, and Benchmarks Datasets. We adopt three large-scale text summarization datasets, CNN/DM, XSUM, and Pubmed. They are introduced in Sec.6.2 Experiment Setup. The hyper-parameters are fine-tuned on 1000 examples from the validation set. The score threshold for phrase removal is set as a certain hyper-parameter α times the average of word centrality. The constituency parser we use is the Berkeley parser [108]. The number of extracted sentences for CNN/DM, XSUM, and Pubmed are 3, 3, and 10, respectively. The sentence and word embeddings are 99 obtained from pre-trained BERT without either supervised or unsupervised fine-tuning the summarization datasets. As to the Rouge score evaluation, we notice that the Rouge implementation of different public packages gives different results. We use ROUGE-1.5.5.pl ∗ script to compute the ROUGE F1. Benchmarks. In our evaluation, we compare ECSTE with various baseline methods and state-of-theart unsupervised extractive techniques. They are as follows: (1) Lead-k is a strong baseline method that chooses the first k sentences in the document as the summary. (2) TextRank [150] uses content overlapping to rank the sentence. (3)LexRank [62] is another early text summarization method similar to Textrank. However, they use the cosine similarity of TF-IDF vectors to rank the sentences. (4)PacSum [298] purely selects sentences using positional-aware centrality. (5)STAS [277] applies attention matrices from Transformers to extract sentences. (6)PMI [172] ranks sentences by point-wise mutual information which is calculated by language models. (7)FAR [136] is a facet-aware centrality-based ranking model for unsupervised extractive text summarization. Note that there are a few unsupervised text summarization methods that work on sub-sentential units [238]. However, due to the lack of publicly available code, their results are not accessible on the datasets utilized in this work. XSUM PubMed CNNDM ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L ROUGE-1 ROUGE-2 ROUGE-L Lead-k 19.43 2.67 15.07 37.95 13.57 34.74 40.18 17.63 36.50 TextRank 17.43 2.81 13.47 38.42 13.71 34.50 30.75 10.04 27.45 LexRank 17.95 3.11 13.92 40.30 16.06 36.28 33.20 11.50 29.62 PMI 19.07 3.22 12.47 - - - 36.68 14.52 23.32 STAS - - - 41.55 15.33 37.01 40.90 18.02 37.21 FAR 19.33 2.60 15.62 41.98 15.66 37.58 40.83 17.85 36.91 Sent. Ext. Only (PacSum) 19.21 2.50 15.60 39.79 14.00 36.09 40.70 17.80 36.90 ECSTE (ours) 20.56 2.89 16.47 42.46 15.63 38.97 40.84 17.84 37.04 Table 6.5: ROUGE SCORE on CNN/DM, XSUM, and PubMed datasets ∗ https://github.com/andersjo/pyrouge 100 CNNDM XSUM PubMed PPL SLOR PPL SLOR PPL SLOR Reference 203.47 3.17 90.04 3.31 42.05 4.97 PacSum 80.59 3.63 60.41 3.38 39.52 4.86 ECSTE 79.52 3.64 122.42 2.82 43.85 4.93 Table 6.6: Flunecy measurement of reference summaries, PacSum and ECSTE. A lower PPL and a higher SLOR indicate better fluency. 6.4.2 Results and Analysis Results. The Rouge scores are shown in Table. 6.5. As experimental results obtained on the XSUM and PubMed datasets, the ECSTE method leads to significant improvements in performance when compared to sentence-level extractive approaches. This can be attributed to ECSTE’s ability to eliminate irrelevant phrases present in the extracted sentences, thereby enhancing the overall effectiveness of the method. Furthermore, Table 6.6 shows the fluency scores of the reference summaries, PacSum, and ECSTE. We can see that due to the constraints of syntactic rules, the compressive summaries generated by ECSTE remain fluent. Compressive Results Visualization. We provide visual examples of the summaries generated by ECSTE on the CNN/DailyMail and XSUM datasets in Fig. 6.2. For comparison purposes, we also show the sentencelevel summaries selected by PacSum. As can be seen, ECSTE removes irrelevant phrases from the initially selected sentences, resulting in more concise and effective summaries. Failure cases. The experimental result shows that ECSTE does not yield an improvement in the Rouge score on the CNN/DailyMail datasets. We have identified that ECSTE has the tendency to remove detailed words with a low degree of similarity to other words in the long document. In particular, in the example illustrated in Figure 6.3, the phrase "under 82 languages in 20 distinct scripts" contains detailed information, and words with a similar semantic meaning do not frequently occur in the document. As a result, this phrase has low word centrality, and ECSTE erroneously removes it. To overcome this limitation, a more 101 Figure 6.2: Exemplary summaries outputted by our model on the CNN/DailyMail and XSUM datasets. For illustration, the compressive summary shows the removed phrases using red strike-through sophisticated unsupervised phrase extraction criterion needs to be proposed to address this issue. We leave this as an area of future research. 6.5 Conclusion and Future Work In this work, we investigate the effects of compressing selected sentences on text summarization. The experimental results show that compressive text summarization can produce concise summaries, making it a desirable approach in situations where length limitations are a crucial factor. However, to ensure the fluency and readability of the generated summaries, it is important to incorporate syntactic guidance. Then, 102 Figure 6.3: Exemplary failure cases on CNN/DailyMail datasets. For illustration, the compressive summary shows the erroneously removed phrases using blue strike-through we propose an unsupervised compressive summarization approach involving sentence and phrase extraction stages. Our experimental results demonstrate that this method leads to a significant improvement in Rouge scores compared to traditional sentence-level text summarization. In the future, we plan to make several extensions to our work. Firstly, we will conduct human evaluations for both key information containment and fluency measurement to obtain more comprehensive and reliable results. Secondly, more advanced phrase extraction removal techniques will be investigated. For example, instead of only using word centrality as the criterion of phrase removal, other metrics, such as redundancy, can be studied. Lastly, we will extend our approach to the scenario of length-limited summarization, where a specified maximum length will constrain the summarization process. 103 Chapter 7 Sub-Structure Beam Search (SBS) for Structured Data Generation 7.1 Introduction Advancements in Large Language Models (LLMs) have seen remarkable progress in recent years, particularly in the field of text generation. A variety of recent studies have delved into the capacity of LLMs to produce structured data, diverging from conventional natural language text generation. These studies have explored the generation of diverse structured data formats, including but not limited to product catalogs [20, 199, 200], tabular data [23, 109, 133, 273], and programming languages [6, 260, 289]. In contrast to plain text, the generation of structured data adheres to predefined output formats governed by specific syntax rules. For instance, in the medical domain, when extracting information from clinical notes, the raw input data is textual in the form of clinical notes. However, the underlying structure is inherently tabular, consisting of a patient’s demographics and other medical diagnostic attributes. Structured data generation can be processed using a single prompt. This is typically achieved by prompt engineering [5, 26] or fine-tuning an LLM [23, 109]. Instead of employing a multi-pass generation approach, where each sub-structure-level data is processed individually, an LLM jointly generates the entire structured data set, including all sub-structures, within a single prompt. For instance, as depicted in Fig. 7.1 (a), products within a catalog typically contains different attributes (i.e., sub-structures), such 104 as brand, color, and size. The generation of each product attribute can be viewed as a standalone substructure-level task. Although it is possible to prompt the LLM to generate each attribute individually via multiple passes, it’s more efficient to employ an LLM to generate all attributes as a whole, leading to a set of complete and detailed product attributes. Another instance is knowledge entity completion, where each entity can be considered a sub-structure. LLMs are prompted with an initial low-quality knowledge base to correct and complete all knowledge entities. Despite the advancements made by LLMs, concerns persist regarding the quality of generation. LLMs are prone to producing inaccurate generations, often due to hallucinations or incorrect references [11, 99, 131, 233, 274]. This challenge isn’t limited to unstructured data generation but also extended to structured data generation, highlighting the necessity to explore methods for enhancing the quality of structured data generation using LLMs. Given the unique format of structured data, LLMs may exhibit confidence only in certain sub-structure generations. In such cases, it is preferable to retain the accurate sub-structure generation rather than discard the entire output. Hence, there arises a need to discern within the LLM’s response, identifying the generated sub-structures where the LLM’s accuracy is assured. Figure 7.1: Examples of structured data generation using LLMs. In this work [266], we introduce a novel decoding method tailored for LLM structured data generation called Sub-structure Beam Search (SUBS). In contrast to conventional decoding methods that focus solely 105 on the token level, SUBS operates at the sub-structure level within the context of structured data generation. It assesses scores for each sub-structure and dynamically adjusts the prompt during generation to optimize the output of the LLM. Depending on the nature of the task, the scoring of a sub-structure can be computed using various methods. For instance, this calculation may rely on a token conditional probability derived from the original LLM or utilize an external model, such as a sub-structure score predictor leveraging the LLM’s internal hidden state. Our experiments, conducted on information extraction tasks like product attribute extraction, reveal that SUBS yields notable performance enhancements compared to traditional text generation decoding methods. The rest of this chapter is organized as follows. In Sec. 7.2, we overview the background of structured data generation with LLMs and introduce SUBS method. Sec. 7.3 studies the score calculation methods on generated sub-structure-level data and presents the experimental results. Finally, conclusions and future work are discussed in Sec. 7.4. 7.2 Method 7.2.1 Preliminary In this work, we denote x as the initial conditioning context (initial prompt) for a text-generative LLM, and y as the entire sequence of generated structured data. t represent a token, while t” and “‘<END>”, can be defined to separate the attribute key and value and different attributes, respectively. An attribute s then can be represented by the following sequence of tokens: s = tk,1, tk,2, ...<SEP>, tv,1, tv,2, ...<END> where tk,i and tv,j are the i-th token in the attribute key and j-th token in the attribute value, respectively. 7.2.2 Sub-structure Score Calculation in Structured Data Generation In the structured data generation process, LLMs may produce incorrect predictions in certain sub-structurelevel data due to inaccurate inferences or hallucinations. As the example of product attribute generation illustrated in Fig. 7.2, the sub-structure sequence “Material <SEP> Plastic <END>” is incorrectly generated while the remaining sub-structure sequences are correct (e.g., “Department <SEP> Women <END>”, “Style <SEP> Casual <END>”). Employing a sub-structure scorer becomes crucial to address such instances. This scorer evaluates each each sub-structure sequence and assigns a score, enabling the identification of incorrect predictions and enhancing the overall reliability of the generated structured data. 7.2.3 Sub-Structure Beam Search Decoding We propose Sub-structure Beam Search (SUBS). SUBS is suitable for the generation of structure data as it operates on the sub-structure level and considers the score for each sub-structure sequence. 107 Figure 7.2: In structured data generation, we tokenize the LLM output into sub-structure sequences and assign a score to each prediction based on the prescribed method [266]. Greedy search is one of the most common decoding methods in text generation. As illustrated in Equation 7.1, when generating a sequence of structured data y with n tokens, it selects the token with the maximum conditional probability at each time i. Yn i=1 arg max ti p(ti |t 0 (7.7) 7.3.2 Experiments on Attribute Extraction We validate SUBS-CP on product attribute extraction, which is one of the information extraction tasks. In this task, the LLM is required to extract attribute/value pairs given a product title or a product description. Datasets In line with prior research [26, 257, 281] focusing on product attribute extraction, we utilize two benchmark datasets: OA-Mine [292] and AE-110K [275]. These publicly available datasets contain English products with annotated attribute/value pairs. OA-Mine contains products sourced from Amazon, while AE110K comprises products obtained from the Sports & Entertainment category on AliExpress. An example from OA-Mine is presented in Fig. 7.4. Figure 7.4: An example from the OA-Mine dataset Experimental Setup Two open-source LLMs are used as the backbone models to evaluate the performance of SUBS-CP. They are Beluga-7B (Llama2-based) [160, 235] and Gemma-2B [228]. 111 In our experiments, we utilize few-shot prompting due to the more consistent output format of LLMs compared to zero-shot prompting. The few-shot prompting template, following the format outlined in [26], is illustrated in Fig. 7.5. The template consists of four parts: role description, task description, demonstration and task input. The role description provides a overview of the role the LLM is expected to perform. The task description elaborates further on the specific details and requirements of the task. The demonstration consists of one or more examples illustrating how the task should be performed. These examples are input-output pairs, illustrating the expected outcome for the given task. The task input is the actual input given to the LLM for task execution. Figure 7.5: Few-shot prompt template for product attribute extraction. Results & Discussions We compare SUBS-CP against two widely used text generation decoding algorithms: greedy search and token-level beam search. The F1 score of the generated attributes served as the evaluation metric. The shot size is set to be 2 in few-shot prompting. The results depicted in Table 7.1 demonstrate that our proposed method, SUBS-CP, outperforms the other two token-level decoding methods across different LLMs without extra training the LLMs. This performance improvement stems from its operation at the sub-structure level, where it evaluates the score 112 for each sub-structure during generation. Furthermore, it is evident that both beam search methods, tokenlevel beam search and SUBS-CP, surpass greedy search as they select the final generation from a pool of candidates. Table 7.1: F1 Score of decoding methods on attribute extraction. The best results are displayed in boldface Beam Size Gemma-2B Beluga-7B AE-110K OA-Mine AE-110K OA-Mine Greedy Search 53.04 50.52 67.35 61.97 Token-Level Beam Search 2 54.97 51.23 69.01 63.25 3 55.44 51.41 68.86 63.28 4 56.06 51.54 68.60 63.41 SUBS-CP (Ours) 2 57.47 51.65 69.32 64.32 3 58.75 50.54 69.91 64.78 4 58.95 49.33 69.69 64.28 7.3.3 Confidence Estimator Model confidence measures the probability of the model’s prediction being correct, reflecting the trustworthiness of the model’s output. A recent study [11] provides evidence suggesting that the hidden state of LLMs can indicate whether they are truthful in their generation. To explore this further, we develop a small external model, referred as the Confidence Estimator, to evaluate the confidence of an LLM on each generated sub-structure based on its hidden state. Subsequently, we utilize the score from the Confidence Estimator as the sub-structure score in SUBS. The whole process is depicted in Fig. 7.6. We denote this sub-structure scoring method as CE and SUBS using CE as SUBS-CE We introduce a Feed Forward Network classifier as the Confidence Estimator. The Confidence Estimator is trained as a binary classifier to predict whether if the generated sub-structure sequence is faithful. Specifically, the hidden states from an LLM are utilized to construct the representation of the sub-structure 113 sequence, which serves as the input for the Confidence Estimator. During inference, the soft label outputted by the Confidence Estimator is then used as the score for the sub-structure sequence generated by the LLM. Figure 7.6: Confidence Estimator: it uses the hidden states from the LLM to assess the score of generated sub-structure As hidden states from different layers of an LLM contain different perspectives of information, hidden states from various layers can be tested. Given a sub-structure sequence s consisting of p tokens, {t1, t2, . . . , tp}, we can retrieve the corresponding hidden state of layer l {h1l , h2l , . . . , hpl} from the LLM. To derive the sub-structure representation from these internal hidden states, we employ Extreme method following [234]. Extreme involves the concatenation of the hidden states of the first and last tokens in the sub-structure sequence, i.e., [h1l ; hpl] 7.3.4 Experiments on Model Confidence We validate our approach in the domain of structured product catalog data. We formulate a generation problem where the LLM is tasked with generating complete product attributes given low-quality product entity data, such as incorrect or missing attribute values. This scenario is similar to the example illustrated in Fig. 7.1 (b), but within the context of product catalogs. Experimentation Setup 114 1) Backbone Structured Data Generation LLM: We use a fine-tuned Product Catalog LLM based on the publicly available MPT 7B [229] as the backbone LLM. The fine-tuned Product Catalog LLM is able to generate all relevant product attributes in a single pass, given product entity data. 2) Confidence Estimator Model Setup and Training Strategy: In this paper, we used a 3-layer Feed Forward Network classifier as the Confidence Estimator. The Confidence Estimator applies ReLU non-linear activation between hidden layers and Sigmoid non-linear activation after the output layer. The training data are Product Catalog data collected from online E-commerce platforms. It is trained using the binary cross-entropy loss function, aiming to predict the faithfulness of the generated sub-structure sequence. 3) Test data: We sampled a test set containing 1k English products with 10k product attributes. The LLM-generated attribute values are labeled by a group of in-house auditors who are trained in this domain. Experimental Results & Analysis Average precision serves as a metric for evaluating performance across various sub-structure confidence estimation and decoding methods. Furthermore, we assess recall at a specific precision since many real-world applications aim to guarantee that the generated sub-structure-level data is above a designated precision threshold while maintaining a high level of recall. SUBS can utilize either the CP or the CE methods to obtain the scores for sub-structure sequences. To assess the recall and precision of product attribute generation, we need to assign a score to each generated product attribute, indicating the probability of its generation. To ensure a fair evaluation of each generated sub-structure sequence, we employ the CP method to assess the score of SUBS’s generation, as CP is purely calculated from the original LLM. Specifically, regardless of whether SUBS utilizes scores from CP or CE during decoding process, we evaluate the probability of generation using CP. The comparison between SUBS and other decoding methods is presented in Table 7.2. In our experiments, the CE uses the attribute representation built on the hidden state of layer 16, as it demonstrates the best performance among other attribute representations. 115 Table 7.2: Average Precision and R@P90 of decoding methods, where the best and the second-best results are displayed in boldface and with underscore, respectively. The probabilities of generations are calculated by the CP method. Decoding Method Beam Size AP R@P90 Greedy Search - 89.1 36.6 Token-Level Beam Search 2 89.3 34.1 4 89.3 34.8 SUBS-CP 2 90.8 58.6 4 90.5 60.5 SUBS-CE 2 92.6 74.7 4 92.6 75.6 Firstly, experimental results show a significant performance gap between SUBS and other compared decoding methods. This improvement in performance by SUBS is attributed to its operation at the substructure level, evaluating the score for each sub-structure during generation. Even SUBS-CP, which utilizes token conditional probability as the source like greedy search and token-level beam search, notably outperforms them without additional training or models. Secondly, SUBS-CE demonstrates the best performance owing to the externally trained Confidence Estimator. Lastly, we observed that a larger beam size can enhance the performance of both token-level beam search and SUBS due to a larger search space in this dataset. 7.3.5 Decoding Speed Besides the performance, the speed of decoding methods is another important factor in LLM generation. We examine the decoding speed of SUBS. Fig. 7.7 presents the F1 score and decoding speed of greedy search, token-level beam search, and SUBS-CP. The beam sizes ranging from 2 to 4 of token-level beam search and SUBS-CP are included. As observed, there exists a trade-off between speed and performance in our proposed SUBS. Due to the additional step involved in calculating the score of the sub-structure, our method naturally requires more time. However, this extra computational effort contributes to achieving better performance. Thus, 116 Figure 7.7: F1 score and decoding speed of decoding methods on AE-110K dataset using Gemma-2B. depending on preferences and priorities, users can decide whether they are willing to utilize SUBS for improved performance at the expense of some speed. 7.4 Conclusion and Future Work In this work, we introduced a novel decoding method called SUBS for LLM structured data generation. It enhances the structured data generation quality by assessing the score for each generated sub-structurelevel data and iteratively refining prompts. We explored two different methods for calculating the scores of the generated sub-structure. The first one uses the conditional probabilities of tokens in sub-structures, and the other one trains a Confidence Estimator. Both methods show significant improvement in the task of product attribute generation using LLMs. As to future work direction, it’s promising to extend the application of SUBS to more types of structured data, such as tabular data and knowledge bases. Additionally, exploring a more advanced sub-structure scoring method to further boost the performance of SUBS is intriguing. 117 Chapter 8 Conclusion and Future Work 8.1 Summary of the Research In this thesis, we focus on the development of syntax-aware NLP techniques and their applications across various NLP tasks. We study the significance of syntax in NLP applications and propose novel methods that integrate syntax at the word-level, sentence-level, and document-level and structured-data-level tasks. To begin, we introduce low-computational and lightweight POS tagging for predicting the syntactic functions of words in sentences. For the word-level task, we incorporate syntax into word embedding learning. Specifically, by the context selected by dependency parsing, and enhancement from word-class mutual information, our proposed classification-specific dependency-based word embedding outperforms several state-of-the-art word embedding methods on text classification tasks. Moving to the sentence-level task, we address the challenge of sentence similarity evaluation by introducing the Syntax-aware Word Mover’s Distance (SynWMD). By integrating dependency parse tree techniques into both word flow assignment and word distance modeling in original WMD, SynWMD significantly improves the performance of sentence similarity evaluation. For the document-level task, we focus on compressive text summarization, where we leverage syntax to select sub-sentential units and condense them into concise summaries while eliminating irrelevant or 118 redundant information. Our research demonstrates that further compression of selected sentences under syntactic compression rules substantially enhances summarization model performance. Lastly, in the task with structured data, we introduce the Sub-structure Beam Search, a novel decoding method tailored for generating structured data which adheres to predefined syntax formats. This method operates not only at the token level but also at the sub-structure level, enabling the selection of the next sub-structure based on their respective scores, thereby enhancing the generation of structured data. 8.2 Future Research Directions Improving language models’ ability to comprehend and generate structured data that adheres to specific format syntax rules, as well as advancing the development of interpretable language models, are crucial for the ongoing advancement of NLP. This section explores two future research directions aimed at boosting the development of language models and their applications. 8.2.1 Mathematical Language Model There exists a great number of structured data in mathematical problems including formulas, equations, tables, and operation trees. With the significant advancements LLMs have made in NLP applications, exploring their potential for mathematical problem-solving becomes a compelling direction. Understanding how LLMs handle these structured mathematical data is crucial for the development of math LLMs. One promising direction is constructing math datasets with structured data, allowing LLMs to undergo continued pre-training or fine-tuning on these resources. For instance, Math23k [259] contains math problems annotated with structured equations, while datasets like AllArith [201]) and Dolphin1878 [217] provide graph or tree structures to depict relationships between variables mentioned in math problems. Another working direction worth exploring is enhancing LLMs’ capacity to comprehend information within structured mathematical data, such as discerning the relationships between numbers in a table. This 119 could be achieved by employing prompt engineering techniques, such as leveraging the chain of thought methodology, to guide LLMs in effectively interpreting and reasoning about mathematical information presented in various formats 8.2.2 Interpretable Language Model Although deep-learning-based LLMs are dominating the NLP field, they are inherently black-box methods without mathematical transparency. Its interpretability is of concern. Efforts have been made to explain the black-box LMs. As mentioned in 2.5.6.3, empirical studies are conducted to understand what PLMs have learned through experimental design. However, the progress in this direction may offer insights but not a satisfactory and clean answer. Providing theoretical explanations or establishing explainable LMs is still a challenging and open issue. A direction to interpretability is to design an interpretable learning model from scratch. For example, we may incorporate Knowledge Graphs(KGs) with LMs. KG is known to be capable of improving the interpretability and transparency of the system in many reasoning tasks such as information retrieval [58] and recommendation systems [283]. For example, reasoning paths and data sources can be provided with predictions when KGs are incorporated for reasoning. It is challenging for LMs to do so. It is critical to develop an interpretable LM to avoid its hallucination in natural language generation [95]. 120 Bibliography [1] Abubakar Abid, Maheen Farooqi, and James Zou. “Persistent anti-muslim bias in large language models”. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, pp. 298–306. [2] Mijit Ablimit, Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara, and Askar Hamdulla. “Uyghur morpheme-based language models and ASR”. In: IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS. IEEE. 2010, pp. 581–584. [3] Mohamed Afify, Olivier Siohan, and Ruhi Sarikaya. “Gaussian mixture language models for speech recognition”. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Vol. 4. IEEE. 2007, pp. IV–29. [4] Himanshu Agarwal and Anirudh Mani. “Part of speech tagging and chunking with conditional random fields”. In: the Proceedings of NWAI workshop. 2006. [5] Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. “Large language models are few-shot clinical information extractors”. In: arXiv preprint arXiv:2205.12689 (2022). [6] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. “Unified Pre-training for Program Understanding and Generation”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 2655–2668. [7] Alan Akbik, Duncan Blythe, and Roland Vollgraf. “Contextual string embeddings for sequence labeling”. In: Proceedings of the 27th international conference on computational linguistics. 2018, pp. 1638–1649. [8] Ramiz M Aliguliyev. “A new sentence similarity measure and sentence based extractive technique for automatic text summarization”. In: Expert Systems with Applications 36.4 (2009), pp. 7764–7772. [9] Ebru Arisoy, Tara N Sainath, Brian Kingsbury, and Bhuvana Ramabhadran. “Deep neural network language models”. In: Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT. 2012, pp. 20–28. 121 [10] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. “A simple but tough-to-beat baseline for sentence embeddings”. In: International conference on learning representations. 2017. [11] Amos Azaria and Tom Mitchell. “The internal state of an llm knows when its lying”. In: arXiv preprint arXiv:2304.13734 (2023). [12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014). [13] Lalit R Bahl, Frederick Jelinek, and Robert L Mercer. “A maximum likelihood approach to continuous speech recognition”. In: IEEE transactions on pattern analysis and machine intelligence 2 (1983), pp. 179–190. [14] Tao Lei Hrishikesh Joshi Regina Barzilay, Tommi Jaakkola, Katerina Tymoshenko, and Alessandro Moschitti Lluıs Marquez. “Semi-supervised Question Retrieval with Gated Convolutions”. In: Proceedings of NAACL-HLT. 2016, pp. 1279–1289. [15] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021, pp. 610–623. [16] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. “A neural probabilistic language model”. In: Advances in neural information processing systems 13 (2000). [17] Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. “Jointly learning to extract and compress”. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, pp. 481–490. [18] Adam Berger, Stephen A Della Pietra, and Vincent J Della Pietra. “A maximum entropy approach to natural language processing”. In: Computational linguistics 22.1 (1996), pp. 39–71. [19] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. “Language (Technology) is Power: A Critical Survey of “Bias” in NLP”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 5454–5476. [20] Ansel Blume, Nasser Zalmout, Heng Ji, and Xian Li. “Generative Models for Product Attribute Extraction”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023, pp. 575–585. [21] Bernd Bohnet, Ryan McDonald, Gonçalo Simões, Daniel Andor, Emily Pitler, and Joshua Maynez. “Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2642–2652. [22] Bernd Bohnet, Ryan McDonald, Gonçalo Simões, Daniel Andor, Emily Pitler, and Joshua Maynez. “Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 2642–2652. doi: 10.18653/v1/P18-1246. 122 [23] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. “Language models are realistic tabular data generators”. In: arXiv preprint arXiv:2210.06280 (2022). [24] Kaj Bostrom and Greg Durrett. “Byte pair encoding is suboptimal for language model pretraining”. In: arXiv preprint arXiv:2004.03720 (2020). [25] Eric Brill. “A simple rule-based part of speech tagger”. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992. 1992. [26] Alexander Brinkmann, Roee Shraga, and Christian Bizer. “Product Attribute Value Extraction using Large Language Models”. In: arXiv preprint arXiv:2310.12537 (2023). [27] Peter F Brown, John Cocke, Stephen A Della Pietra, Vincent J Della Pietra, Frederick Jelinek, John Lafferty, Robert L Mercer, and Paul S Roossin. “A statistical approach to machine translation”. In: Computational linguistics 16.2 (1990), pp. 79–85. [28] Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. “Class-based n-gram models of natural language”. In: Computational linguistics 18.4 (1992), pp. 467–480. [29] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [30] Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, and Magnus Sahlgren. “Semantic re-tuning with contrastive tension”. In: International Conference on Learning Representations. 2020. [31] Ciprian Chelba and Frederick Jelinek. “Exploiting syntactic structure for language modeling”. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. 1998, pp. 225–231. [32] Ciprian Chelba and Frederick Jelinek. “Structured language modeling”. In: Computer Speech & Language 14.4 (2000), pp. 283–332. [33] Danqi Chen and Christopher D Manning. “A fast and accurate dependency parser using neural networks”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 740–750. [34] Stanley F Chen and Joshua Goodman. “An empirical study of smoothing techniques for language modeling”. In: Computer Speech & Language 13.4 (1999), pp. 359–394. [35] Stanley F Chen and Ronald Rosenfeld. “Efficient sampling and feature selection in whole sentence maximum entropy language models”. In: 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258). Vol. 1. IEEE. 1999, pp. 549–552. [36] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016, pp. 785–794. 123 [37] Alebachew Chiche and Betselot Yitagesu. “Part of speech tagging: a systematic review of deep learning and machine learning approaches”. In: Journal of Big Data 9.1 (2022), pp. 1–25. [38] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. “Palm: Scaling language modeling with pathways”. In: arXiv preprint arXiv:2204.02311 (2022). [39] Kenneth W Church and William A Gale. “A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams”. In: Computer Speech & Language 5.1 (1991), pp. 19–54. [40] Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. “Sentence mover’s similarity: Automatic evaluation for multi-sentence texts”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 2748–2760. [41] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. “Electra: Pre-training text encoders as discriminators rather than generators”. In: arXiv preprint arXiv:2003.10555 (2020). [42] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. “Pre-training transformers as energy-based cloze models”. In: arXiv preprint arXiv:2012.08561 (2020). [43] Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. “A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 615–621. doi: 10.18653/v1/N18-2097. [44] Alexis Conneau and Douwe Kiela. “SentEval: An Evaluation Toolkit for Universal Sentence Representations”. In: arXiv preprint arXiv:1803.05449 (2018). [45] Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. “Morph-based speech recognition and modeling of out-of-vocabulary words across languages”. In: ACM Transactions on Speech and Language Processing (TSLP) 5.1 (2007), pp. 1–29. [46] Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology Helsinki, 2005. [47] Leyang Cui and Yue Zhang. “Hierarchically-Refined Label Attention Network for Sequence Labeling”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, pp. 4115–4128. [48] Raj Dabre and Atsushi Fujita. “Recurrent stacking of layers for compact neural machine translation models”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 01. 2019, pp. 6292–6299. 124 [49] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. “Transformer-xl: Attentive language models beyond a fixed-length context”. In: arXiv preprint arXiv:1901.02860 (2019). [50] John N Darroch and Douglas Ratcliff. “Generalized iterative scaling for log-linear models”. In: The annals of mathematical statistics (1972), pp. 1470–1480. [51] Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt. “Representation learning for very short texts using weighted word embedding aggregation”. In: Pattern Recognition Letters 80 (2016), pp. 150–156. [52] Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D Manning. “Universal Stanford dependencies: A cross-linguistic typology.” In: LREC. Vol. 14. 2014, pp. 4585–4592. [53] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. “Universal transformers”. In: arXiv preprint arXiv:1807.03819 (2018). [54] Stephen A Della Pietra, Vincent J Della Pietra, Robert L Mercer, and Salim Roukos. “Adaptive language modeling using minimum discriminant estimation”. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992. 1992. [55] Hongli Deng, Lei Zhang, and Lituan Wang. “Global context-dependent recurrent neural network language model with sparse feature learning”. In: Neural Computing and Applications 31.2 (2019), pp. 999–1011. [56] Shrey Desai, Jiacheng Xu, and Greg Durrett. “Compressive summarization with plausibility and salience modeling”. In: arXiv preprint arXiv:2010.07886 (2020). [57] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [58] Laura Dietz, Chenyan Xiong, Jeff Dalton, and Edgar Meij. “The Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR)”. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018, pp. 1423–1426. [59] Timothy Dozat and Christopher D Manning. “Deep biaffine attention for neural dependency parsing”. In: arXiv preprint arXiv:1611.01734 (2016). [60] Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. “Learning-based single-document summarization with compression and anaphoricity constraints”. In: arXiv preprint arXiv:1603.08887 (2016). [61] Brill Eric. “Some advances in transformation-based part of speech tagging”. In: Proceedings of the Twelveth AAAI, 1994 (1994). 125 [62] Günes Erkan and Dragomir R Radev. “Lexrank: Graph-based lexical centrality as salience in text summarization”. In: Journal of artificial intelligence research 22 (2004), pp. 457–479. [63] Kawin Ethayarajh. “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 55–65. doi: 10.18653/v1/D19-1006. [64] Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 889–898. [65] Marcello Federico. “Bayesian estimation methods for n-gram language model adaptation”. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96. Vol. 1. IEEE. 1996, pp. 240–243. [66] John R Firth. “A synopsis of linguistic theory, 1930-1955”. In: Studies in linguistic analysis (1957). [67] Philip Gage. “A new algorithm for data compression”. In: C Users Journal 12.2 (1994), pp. 23–38. [68] Tianyu Gao, Adam Fisch, and Danqi Chen. “Making Pre-trained Language Models Better Few-shot Learners”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 3816–3830. [69] Tianyu Gao, Adam Fisch, and Danqi Chen. “Making pre-trained language models better few-shot learners”. In: arXiv preprint arXiv:2012.15723 (2020). [70] Tianyu Gao, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings”. In: Empirical Methods in Natural Language Processing (EMNLP). 2021. [71] Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. “Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information”. In: arXiv preprint arXiv:1808.08079 (2018). [72] Karan Goel, Nazneen Fatema Rajani, Jesse Vig, Zachary Taschdjian, Mohit Bansal, and Christopher Ré. “Robustness Gym: Unifying the NLP Evaluation Landscape”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. 2021, pp. 42–55. [73] J Gu, J Bradbury, C Xiong, VOK Li, and R Socher. “Non-autoregressive neural machine translation”. In: International Conference on Learning Representations (ICLR). 2018. [74] Jiatao Gu, Kyunghyun Cho, and Victor OK Li. “Trainable greedy decoding for neural machine translation”. In: arXiv preprint arXiv:1702.02429 (2017). 126 [75] Nianlong Gu, Elliott Ash, and Richard Hahnloser. “MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6507–6522. doi: 10.18653/v1/2022.acl-long.450. [76] Joseph Gubbins and Andreas Vlachos. “Dependency language models for sentence completion”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013, pp. 1405–1410. [77] Demi Guo, Alexander M Rush, and Yoon Kim. “Parameter-Efficient Transfer Learning with Diff Pruning”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 4884–4896. [78] Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. “Pre-trained models: Past, present and future”. In: AI Open 2 (2021), pp. 225–250. [79] Zellig S Harris. “Distributional structure”. In: Word 10.2-3 (1954), pp. 146–162. [80] Benjamin Heinzerling and Michael Strube. “Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 273–291. [81] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. “Teaching machines to read and comprehend”. In: Advances in neural information processing systems 28 (2015). [82] John Hewitt and Christopher D Manning. “A structural probe for finding syntax in word representations”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 4129–4138. [83] Tsutomu Hirao, Masaaki Nishino, and Masaaki Nagata. “Oracle summaries of compressive summarization”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2017, pp. 275–280. [84] Sepp Hochreiter. “The vanishing gradient problem during learning recurrent neural nets and problem solutions”. In: International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6.02 (1998), pp. 107–116. [85] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. “The curious case of neural text degeneration”. In: arXiv preprint arXiv:1904.09751 (2019). [86] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. “Parameter-efficient transfer learning for NLP”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2790–2799. 127 [87] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “Lora: Low-rank adaptation of large language models”. In: arXiv preprint arXiv:2106.09685 (2021). [88] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In: International Conference on Learning Representations. [89] Dandan Huang, Leyang Cui, Sen Yang, Guangsheng Bao, Kun Wang, Jun Xie, and Yue Zhang. “What have we achieved on text summarization?” In: arXiv preprint arXiv:2010.04529 (2020). [90] Kyuyeon Hwang and Wonyong Sung. “Character-level language modeling with hierarchical recurrent neural networks”. In: 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE. 2017, pp. 5720–5724. [91] Daphne Ippolito, David Grangier, Douglas Eck, and Chris Callison-Burch. “Toward better storylines with sentence-level language models”. In: arXiv preprint arXiv:2005.05255 (2020). [92] Frederick Jelinek. “Continuous speech recognition by statistical methods”. In: Proceedings of the IEEE 64.4 (1976), pp. 532–556. [93] Frederick Jelinek. “Interpolated estimation of Markov source parameters from sparse data”. In: Proc. Workshop on Pattern Recognition in Practice, 1980. 1980. [94] Frederick Jelinek, Lalit Bahl, and Robert Mercer. “Design of a linguistic statistical decoder for the recognition of continuous speech”. In: IEEE Transactions on Information Theory 21.3 (1975), pp. 250–256. [95] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. “Survey of hallucination in natural language generation”. In: ACM Computing Surveys 55.12 (2023), pp. 1–38. [96] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. “TinyBERT: Distilling BERT for Natural Language Understanding”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, pp. 4163–4174. [97] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. “Is bert really robust? a strong baseline for natural language attack on text classification and entailment”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 34. 05. 2020, pp. 8018–8025. [98] William Ernest Johnson. “Probability: The deductive and inductive problems”. In: Mind 41.164 (1932), pp. 409–423. [99] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. “Language models (mostly) know what they know”. In: arXiv preprint arXiv:2207.05221 (2022). 128 [100] Moonyoung Kang, Tim Ng, and Long Nguyen. “Mandarin word-character hybrid-input neural network language model”. In: Twelfth Annual Conference of the International Speech Communication Association. 2011. [101] Katharina Kann, Sascha Rothe, and Katja Filippova. “Sentence-level fluency evaluation: References help, but can be spared!” In: arXiv preprint arXiv:1809.08731 (2018). [102] Slava Katz. “Estimation of probabilities from sparse data for the language model component of a speech recognizer”. In: IEEE transactions on acoustics, speech, and signal processing 35.3 (1987), pp. 400–401. [103] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of NAACL-HLT. 2019, pp. 4171–4186. [104] Daniel Kiecza, Tanja Schultz, and Alex Waibel. “Data-driven determination of appropriate dictionary units for Korean LVCSR”. In: Proceedings of ICASSP. 1999, pp. 323–327. [105] Jin-Dong Kim, Sang-Zoo Lee, and Hae Chang Rim. “HMM Specialization with Selective Lexicalization”. In: 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 1999. [106] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-goo Lee. “Are pre-trained language models aware of phrases? simple but strong baselines for grammar induction”. In: arXiv preprint arXiv:2002.00737 (2020). [107] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. “Character-aware neural language models”. In: Thirtieth AAAI conference on artificial intelligence. 2016. [108] Nikita Kitaev and Dan Klein. “Constituency Parsing with a Self-Attentive Encoder”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2676–2686. [109] Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. “KBFormer: A Diffusion Model for Structured Entity Completion”. In: arXiv preprint arXiv:2312.05253 (2023). [110] Reinhard Kneser and Hermann Ney. “Improved backing-off for m-gram language modeling”. In: 1995 international conference on acoustics, speech, and signal processing. Vol. 1. IEEE. 1995, pp. 181–184. [111] Stefan Kombrink, Tomas Mikolov, Martin Karafiát, and Lukás Burget. “Recurrent Neural Network Based Language Modeling in Meeting Recognition.” In: Interspeech. Vol. 11. 2011, pp. 2877–2880. [112] Alexandros Komninos and Suresh Manandhar. “Dependency based embeddings for sentence classification tasks”. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016, pp. 1490–1500. [113] Taku Kudo. “Subword regularization: Improving neural network translation models with multiple subword candidates”. In: arXiv preprint arXiv:1804.10959 (2018). 129 [114] Thomas Kuhn, Heinrich Niemann, and Ernst Günter Schukat-Talamazzini. “Ergodic hidden Markov models and polygrams for language modeling”. In: Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing. Vol. 1. IEEE. 1994, pp. I–357. [115] Ilia Kulikov, Alexander Miller, Kyunghyun Cho, and Jason Weston. “Importance of Search and Evaluation Strategies in Neural Dialogue Modeling”. In: Proceedings of the 12th International Conference on Natural Language Generation. 2019, pp. 76–87. [116] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In: Journal of Visual Communication and Image Representation (2022), p. 103685. [117] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. “From word embeddings to document distances”. In: International conference on machine learning. PMLR. 2015, pp. 957–966. [118] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In: arXiv preprint arXiv:1909.11942 (2019). [119] Haejun Lee, Drew A Hudson, Kangwook Lee, and Christopher D Manning. “SLM: Learning a discourse language representation with sentence unshuffling”. In: arXiv preprint arXiv:2010.16249 (2020). [120] Sang-Zoo Lee, Jun’ichi Tsujii, and Hae Chang Rim. “Lexicalized hidden Markov models for part-of-speech tagging”. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics. 2000. [121] Brian Lester, Rami Al-Rfou, and Noah Constant. “The Power of Scale for Parameter-Efficient Prompt Tuning”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021, pp. 3045–3059. [122] Michael Levit, Sarangarajan Parthasarathy, Shuangyu Chang, Andreas Stolcke, and Benoit Dumoulin. “Word-phrase-entity language models: Getting more mileage out of n-grams”. In: Fifteenth Annual Conference of the International Speech Communication Association. 2014. [123] Omer Levy and Yoav Goldberg. “Dependency-based word embeddings”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014, pp. 302–308. [124] Omer Levy and Yoav Goldberg. “Neural word embedding as implicit matrix factorization”. In: Advances in neural information processing systems 27 (2014), pp. 2177–2185. [125] Omer Levy, Yoav Goldberg, and Ido Dagan. “Improving distributional similarity with lessons learned from word embeddings”. In: Transactions of the association for computational linguistics 3 (2015), pp. 211–225. [126] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 7871–7880. 130 [127] Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. “On the sentence embeddings from pre-trained language models”. In: arXiv preprint arXiv:2011.05864 (2020). [128] Chen Li, Jianxin Li, Yangqiu Song, and Ziwei Lin. “Training and evaluating improved dependency-based word embeddings”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018. [129] Hongwei Li, Hongyan Mao, and Jingzi Wang. “Part-of-speech tagging with rule-based data preprocessing and transformer”. In: Electronics 11.1 (2021), p. 56. [130] Jiwei Li, Will Monroe, and Dan Jurafsky. “A simple, fast diverse decoding algorithm for neural generation”. In: arXiv preprint arXiv:1611.08562 (2016). [131] Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. “HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models”. In: arXiv preprint arXiv:2305.11747 (2023). [132] Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. “ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation”. In: arXiv preprint arXiv:2210.13304 (2022). [133] Tong Li, Zhihao Wang, Liangying Shao, Xuling Zheng, Xiaoli Wang, and Jinsong Su. “A Sequence-to-Sequence&Set Model for Text-to-Table Generation”. In: arXiv preprint arXiv:2306.00137 (2023). [134] Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 4582–4597. [135] Ying Li, Zhenghua Li, Min Zhang, Rui Wang, Sheng Li, and Luo Si. “Self-attentive Biaffine Dependency Parsing.” In: IJCAI. 2019, pp. 5067–5073. [136] Xinnian Liang, Shuangzhi Wu, Mu Li, and Zhoujun Li. “Improving unsupervised extractive summarization with facet-aware modeling”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, pp. 1685–1697. [137] George James Lidstone. “Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities”. In: Transactions of the Faculty of Actuaries 8.182-192 (1920), p. 13. [138] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luis Marujo, and Tiago Luís. “Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 1520–1530. [139] Jiangming Liu and Yue Zhang. “In-order transition-based constituent parsing”. In: Transactions of the Association for Computational Linguistics 5 (2017), pp. 413–424. 131 [140] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing”. In: ACM Computing Surveys 55.9 (2023), pp. 1–35. [141] Qian Liu, He-Yan Huang, Yang Gao, Xiaochi Wei, Yuxin Tian, and Luyang Liu. “Task-oriented word embedding for text classification”. In: Proceedings of the 27th international conference on computational linguistics. 2018, pp. 2023–2032. [142] Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. “Topical word embeddings”. In: Twenty-ninth AAAI conference on artificial intelligence. 2015. [143] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). [144] Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Ming Zhou, and Dawei Song. “A tensorized transformer for language modeling”. In: Advances in neural information processing systems 32 (2019). [145] Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. “Stack-Pointer Networks for Dependency Parsing”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 1403–1414. [146] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. “Building a large annotated corpus of English: The Penn Treebank”. In: University of Pennsylvania, Department of Computer and Information Science Technical Report No. MS-CIS-93-87 (1993). [147] André FT Martins and Noah A Smith. “Summarization with a joint model for sentence extraction and compression”. In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing. 2009, pp. 1–9. [148] Chandler May, Alex Wang, Shikha Bordia, Samuel R Bowman, and Rachel Rudinger. “On Measuring Social Biases in Sentence Encoders”. In: Proceedings of NAACL-HLT. 2019, pp. 622–628. [149] Afonso Mendes, Shashi Narayan, Sebastiao Miranda, Zita Marinho, André FT Martins, and Shay B Cohen. “Jointly extracting and compressing documents with summary state representations”. In: arXiv preprint arXiv:1904.02020 (2019). [150] Rada Mihalcea and Paul Tarau. “Textrank: Bringing order into text”. In: Proceedings of the 2004 conference on empirical methods in natural language processing. 2004, pp. 404–411. [151] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013). [152] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. “Advances in Pre-Training Distributed Word Representations”. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). 2018. 132 [153] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. “Recurrent ` neural network based language model.” In: Interspeech. Vol. 2. 3. Makuhari. 2010, pp. 1045–1048. [154] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. ` “Extensions of recurrent neural network language model”. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2011, pp. 5528–5531. [155] Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kombrink, and Jan Cernocky. “Subword language modeling with neural networks”. In: preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf) 8.67 (2012). [156] Piotr Mirowski and Andreas Vlachos. “Dependency recurrent neural language models for sentence completion”. In: arXiv preprint arXiv:1507.01193 (2015). [157] Yasumasa Miyamoto and Kyunghyun Cho. “Gated word-character recurrent language model”. In: arXiv preprint arXiv:1606.01700 (2016). [158] Ahmed A Mohamed and Sanguthevar Rajasekaran. “Improving query-based summarization using document graphs”. In: 2006 IEEE international symposium on signal processing and information technology. IEEE. 2006, pp. 408–410. [159] Jiaqi Mu, Suma Bhat, and Pramod Viswanath. “All-but-the-top: Simple and effective postprocessing for word representations”. In: arXiv preprint arXiv:1702.01417 (2017). [160] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. 2023. arXiv: 2306.02707 [cs.CL]. [161] Moin Nadeem, Anna Bethke, and Siva Reddy. “StereoSet: Measuring stereotypical bias in pretrained language models”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 5356–5371. [162] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. “Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization”. In: ArXiv abs/1808.08745 (2018). [163] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. “Efficient large-scale language model training on gpu clusters using megatron-lm”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, pp. 1–15. [164] Xuan-Phi Nguyen, Shafiq Joty, Steven CH Hoi, and Richard Socher. “Tree-structured attention with hierarchical accumulation”. In: arXiv preprint arXiv:2002.08046 (2020). [165] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. “Adversarial NLI: A New Benchmark for Natural Language Understanding”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 4885–4901. 133 [166] Thomas R Niesler and Philip C Woodland. “A variable-length category-based n-gram language model”. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. Vol. 1. IEEE. 1996, pp. 164–167. [167] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. “Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection”. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020, pp. 4034–4043. [168] Franz Josef Och, Nicola Ueffing, and Hermann Ney. “An efficient A* search algorithm for statistical machine translation”. In: Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation. 2001. [169] Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, and Ángel Iglesias. “SyMSS: A syntax-based measure for short-text semantic similarity”. In: Data & Knowledge Engineering 70.4 (2011), pp. 390–405. [170] Marwan Omar, Soohyeon Choi, DaeHun Nyang, and David Mohaisen. “Robust natural language processing: Recent advances, challenges, and future directions”. In: IEEE Access (2022). [171] R OpenAI. “GPT-4 technical report”. In: arXiv (2023), pp. 2303–08774. [172] Vishakh Padmakumar and He He. “Unsupervised extractive summarization using pointwise mutual information”. In: arXiv preprint arXiv:2102.06272 (2021). [173] Sebastian Padó and Mirella Lapata. “Dependency-based construction of semantic space models”. In: Computational Linguistics 33.2 (2007), pp. 161–199. [174] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Tech. rep. Stanford InfoLab, 1999. [175] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 1532–1543. [176] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 2227–2237. doi: 10.18653/v1/N18-1202. [177] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep Contextualized Word Representations”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 2227–2237. doi: 10.18653/v1/N18-1202. 134 [178] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. “AdapterFusion: Non-Destructive Task Composition for Transfer Learning”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, pp. 487–503. [179] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. “Automatic prompt optimization with" gradient descent" and beam search”. In: arXiv preprint arXiv:2305.03495 (2023). [180] Avinesh PVS and G Karthik. “Part-of-speech tagging and chunking using conditional random fields and transformation based learning”. In: Shallow parsing for south asian languages 21.21-24 (2007), p. 2. [181] Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2020. url: https://nlp.stanford.edu/pubs/qi2020stanza.pdf. [182] Guanghui Qin and Jason Eisner. “Learning How to Ask: Querying LMs with Mixtures of Soft Prompts”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 5203–5212. [183] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. “Pre-trained models for natural language processing: A survey”. In: Science China Technological Sciences 63.10 (2020), pp. 1872–1897. [184] Zhe Quan, Zhi-Jie Wang, Yuquan Le, Bin Yao, Kenli Li, and Jian Yin. “An efficient framework for sentence similarity modeling”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.4 (2019), pp. 853–865. [185] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. “Improving language understanding by generative pre-training”. In: (2018). [186] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9. [187] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” In: J. Mach. Learn. Res. 21.140 (2020), pp. 1–67. [188] Adwait Ratnaparkhi. “A maximum entropy model for part-of-speech tagging”. In: Conference on empirical methods in natural language processing. 1996. [189] Nils Reimers and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks”. In: arXiv preprint arXiv:1908.10084 (2019). [190] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. “Character-level language modeling with deeper self-attention”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 33. 01. 2019, pp. 3159–3166. 135 [191] Marco Tulio Ribeiro, Carlos Guestrin, and Sameer Singh. “Are red roses red? evaluating consistency of question-answering models”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 6174–6184. [192] Klaus Ries, Finn Dag Buo, and Alex Waibel. “Class phrase models for language modeling”. In: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96. Vol. 1. IEEE. 1996, pp. 398–401. [193] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. “A primer in bertology: What we know about how bert works”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 842–866. [194] Ronald Rosenfeld. “A whole sentence maximum entropy language model”. In: 1997 IEEE workshop on automatic speech recognition and understanding proceedings. IEEE. 1997, pp. 230–237. [195] Ronald Rosenfeld, Stanley F Chen, and Xiaojin Zhu. “Whole-sentence exponential language models: a vehicle for linguistic-statistical integration”. In: Computer Speech & Language 15.1 (2001), pp. 55–73. [196] Roni Rosenfeld. “A maximum entropy approach to adaptive statistical language modeling”. In: (1996). [197] Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. “Leveraging pre-trained checkpoints for sequence generation tasks”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 264–280. [198] Tomaž Rotovnik, Mirjam Sepesy Maučec, and Zdravko Kačič. “Large vocabulary continuous speech recognition of an inflected language using stems and endings”. In: Speech communication 49.6 (2007), pp. 437–452. [199] Kalyani Roy, Pawan Goyal, and Manish Pandey. “Attribute value generation from product title using language models”. In: Proceedings of The 4th Workshop on e-Commerce and NLP. 2021, pp. 13–17. [200] Kalyani Roy, Tapas Nayak, and Pawan Goyal. “Exploring Generative Models for Joint Attribute Value Extraction from Product Titles”. In: arXiv preprint arXiv:2208.07130 (2022). [201] Subhro Roy and Dan Roth. “Unit dependency graph and its application to arithmetic word problem solving”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 31. 1. 2017. [202] Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the wild: On corruption robustness of neural nlp systems”. In: Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part III 26. Springer. 2019, pp. 235–247. [203] Haşim Sak, Murat Saraclar, and Tunga Güngör. “Morphology-based and sub-word language modeling for Turkish speech recognition”. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2010, pp. 5402–5405. 136 [204] Julian Salazar, Davis Liang, Toan Q Nguyen, and Katrin Kirchhoff. “Masked Language Model Scoring”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 2699–2712. [205] Evan Sandhaus. “The new york times annotated corpus”. In: Linguistic Data Consortium, Philadelphia 6.12 (2008), e26752. [206] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. In: arXiv preprint arXiv:1910.01108 (2019). [207] George Saon and Mukund Padmanabhan. “Data-driven approach to designing compound words for continuous speech recognition”. In: IEEE transactions on Speech and audio processing 9.4 (2001), pp. 327–332. [208] Ruhi Sarikaya, Mohamed Afify, and Yuqing Gao. “Joint morphological-lexical language modeling (JMLLM) for Arabic”. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Vol. 4. IEEE. 2007, pp. IV–181. [209] Nikunj Saunshi, Sadhika Malladi, and Sanjeev Arora. “A mathematical exploration of why language models help solve downstream tasks”. In: arXiv preprint arXiv:2010.03648 (2020). [210] Timo Schick and Hinrich Schütze. “Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, pp. 255–269. [211] Timo Schick and Hinrich Schütze. “Few-shot text generation with natural language instructions”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021, pp. 390–402. [212] Mike Schuster and Kaisuke Nakajima. “Japanese and korean voice search”. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2012, pp. 5149–5152. [213] Holger Schwenk. “Continuous space language models”. In: Computer Speech & Language 21.3 (2007), pp. 492–518. [214] Holger Schwenk and Jean-Luc Gauvain. “Training neural network language models on very large corpora”. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing. 2005, pp. 201–208. [215] Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural Machine Translation of Rare Words with Subword Units”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016, pp. 1715–1725. [216] Or Sharir, Barak Peleg, and Yoav Shoham. “The cost of training nlp models: A concise overview”. In: arXiv preprint arXiv:2004.08900 (2020). 137 [217] Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. “Automatically solving number word problems by semantic parsing and reasoning”. In: Proceedings of the 2015 conference on empirical methods in natural language processing. 2015, pp. 1132–1142. [218] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. “Autoprompt: Eliciting knowledge from language models with automatically generated prompts”. In: arXiv preprint arXiv:2010.15980 (2020). [219] Miikka Silfverberg, Teemu Ruokolainen, Krister Lindén, and Mikko Kurimo. “Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014, pp. 259–264. [220] Mitchell Stern, Jacob Andreas, and Dan Klein. “A Minimal Span-Based Neural Constituency Parser”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017, pp. 818–827. [221] Emma Strubell, Ananya Ganesh, and Andrew McCallum. “Energy and Policy Considerations for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 3645–3650. [222] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. “Linguistically-informed self-attention for semantic role labeling”. In: arXiv preprint arXiv:1804.08199 (2018). [223] Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. “Whitening sentence representations for better semantics and faster retrieval”. In: arXiv preprint arXiv:2103.15316 (2021). [224] Yixuan Su, Deng Cai, Yan Wang, David Vandyke, Simon Baker, Piji Li, and Nigel Collier. “Non-Autoregressive Text Generation with Pre-trained Language Models”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, pp. 234–243. [225] Bernhard Suhm and Alex Waibel. “Towards better language models for spontaneous speech”. In: (1994). [226] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. “LSTM neural networks for language modeling”. In: Thirteenth annual conference of the international speech communication association. 2012. [227] Ilya Sutskever, James Martens, and Geoffrey E Hinton. “Generating text with recurrent neural networks”. In: ICML. 2011. [228] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. “Gemma: Open models based on gemini research and technology”. In: arXiv preprint arXiv:2403.08295 (2024). 138 [229] MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. Accessed: 2023-05-05. 2023. url: www.mosaicml.com/blog/mpt-7b (visited on 05/05/2023). [230] Ian Tenney, Dipanjan Das, and Ellie Pavlick. “BERT rediscovers the classical NLP pipeline”. In: arXiv preprint arXiv:1905.05950 (2019). [231] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. “What do you learn from context? probing for sentence structure in contextualized word representations”. In: arXiv preprint arXiv:1905.06316 (2019). [232] Neil C Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F Manso. “The computational limits of deep learning”. In: arXiv preprint arXiv:2007.05558 (2020). [233] Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P Parikh. “Sticking to the facts: Confident decoding for faithful data-to-text generation”. In: arXiv preprint arXiv:1910.08684 (2019). [234] Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, and Kevin Gimpel. “A Cross-Task Analysis of Text Span Representations”. In: Proceedings of the 5th Workshop on Representation Learning for NLP. 2020, pp. 166–176. [235] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. arXiv: 2307.09288 [cs.CL]. [236] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Well-read students learn better: On the importance of pre-training compact models”. In: arXiv preprint arXiv:1908.08962 (2019). [237] Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. “The infinite HMM for unsupervised PoS tagging”. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009, pp. 678–687. [238] Natalia Vanetik, Marina Litvak, Elena Churkin, and Mark Last. “An unsupervised constrained optimization approach to compressive summarization”. In: Information Sciences 509 (2020), pp. 22–35. 139 [239] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). [240] Lyan Verwimp, Joris Pelemans, Patrick Wambacq, et al. “Character-word LSTM language models”. In: arXiv preprint arXiv:1704.02813 (2017). [241] Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. “Diverse beam search for improved description of complex scenes”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018. [242] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned”. In: arXiv preprint arXiv:1905.09418 (2019). [243] Xiaojun Wan. “An Exploration of Document Impact on Graph-Based Multi-Document Summarization”. In: Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Honolulu, Hawaii: Association for Computational Linguistics, Oct. 2008, pp. 755–762. url: https://aclanthology.org/D08-1079. [244] Alex Wang and Kyunghyun Cho. “Bert has a mouth, and it must speak: Bert as a markov random field language model”. In: arXiv preprint arXiv:1902.04094 (2019). [245] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. “Superglue: A stickier benchmark for general-purpose language understanding systems”. In: Advances in neural information processing systems 32 (2019). [246] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. “GLUE: A multi-task benchmark and analysis platform for natural language understanding”. In: arXiv preprint arXiv:1804.07461 (2018). [247] Bin Wang, Fenxiao Chen, Yuncheng Wang, and C.-C. Jay Kuo. “Efficient Sentence Embedding via Semantic Subspace Analysis”. In: 2020 25th International Conference on Pattern Recognition (ICPR). IEEE. 2021, pp. 119–125. [248] Bin Wang, C.-C. Kuo, and Haizhou Li. “Just Rank: Rethinking Evaluation with Word and Sentence Similarities”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6060–6077. url: https://aclanthology.org/2022.acl-long.419. [249] Bin Wang and C.-C. Jay Kuo. “SBERT-WK: A sentence embedding method by dissecting bert-based word models”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), pp. 2146–2157. [250] Bin Wang, C.-C. Jay Kuo, and Haizhou Li. “Just Rank: Rethinking Evaluation with Word and Sentence Similarities”. In: arXiv preprint arXiv:2203.02679 (2022). [251] Bin Wang and Haizhou Li. “Relational Sentence Embedding for Flexible Semantic Matching”. In: arXiv preprint arXiv:2212.08802 (2022). 140 [252] Bin Wang, Chen Zhang, Chengwei Wei, and Haizhou Li. “A Focused Study on Sequence Length for Dialogue Summarization”. In: arXiv preprint arXiv:2209.11910 (2022). [253] Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, and Haizhou Li. “Analyzing and Evaluating Faithfulness in Dialogue Summarization”. In: arXiv preprint arXiv:2210.11777 (2022). [254] Kexin Wang, Nils Reimers, and Iryna Gurevych. “TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning”. In: Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 671–688. doi: 10.18653/v1/2021.findings-emnlp.59. [255] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. “A unified tagging solution: Bidirectional lstm recurrent neural network with word embedding”. In: arXiv preprint arXiv:1511.00215 (2015). [256] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. “Part-of-speech tagging with bidirectional long short-term memory recurrent neural network”. In: arXiv preprint arXiv:1510.06168 (2015). [257] Qifan Wang, Li Yang, Bhargav Kanagal, Sumit Sanghai, D Sivakumar, Bin Shu, Zac Yu, and Jon Elsas. “Learning to extract attribute value from product via question answering: A multi-task approach”. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020, pp. 47–55. [258] Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Yicheng Zou, Xin Zhou, Jiacheng Ye, Yongxin Zhang, Rui Zheng, Zexiong Pang, et al. “Textflint: Unified multilingual robustness evaluation toolkit for natural language processing”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. 2021, pp. 347–355. [259] Yan Wang, Xiaojiang Liu, and Shuming Shi. “Deep neural solver for math word problems”. In: Proceedings of the 2017 conference on empirical methods in natural language processing. 2017, pp. 845–854. [260] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. “CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021, pp. 8696–8708. [261] Yun-Cheng Wang, Jintang Xue, Chengwei Wei, and C-C Jay Kuo. “An overview on generative ai at scale with edge-cloud computing”. In: IEEE Open Journal of the Communications Society (2023). [262] Zhiguo Wang and Abraham Ittycheriah. “Faq-based question answering via word alignment”. In: arXiv preprint arXiv:1507.02628 (2015). [263] Zihao Wang, Yong Zhang, and Hao Wu. “Structural-Aware sentence similarity with recursive optimal transport”. In: arXiv preprint arXiv:2002.00745 (2020). 141 [264] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. “Structured Pruning of Large Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 6151–6162. [265] Taro Watanabe and Eiichiro Sumita. “Transition-based neural constituent parsing”. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015, pp. 1169–1179. [266] Chengwei Wei et al. “Confidence-Aware Sub-Structure Beam Search (CABS): Mitigating Hallucination in Structured Data Generation with Large Language Models”. In: Manuscript submitted for publication. 2024, p. 0000. [267] Chengwei Wei, Bin Wang, and C-C Jay Kuo. “Synwmd: Syntax-aware word mover’s distance for sentence similarity evaluation”. In: Pattern Recognition Letters 170 (2023), pp. 48–55. [268] Chengwei Wei, Bin Wang, and C.-C. Jay Kuo. “Task-Specific Dependency-based Word Embedding Methods”. In: Pattern Recognition Letters (2022). [269] Chengwei Wei, Bin Wang, and C.-C. Jay Kuo. “Task-Specific Dependency-based Word Embedding Methods”. In: Pattern Recognition Letters (2022). issn: 0167-8655. doi: https://doi.org/10.1016/j.patrec.2022.05.016. [270] Chengwei Wei, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. “An overview on language models: Recent developments and outlook”. In: arXiv preprint arXiv:2303.05759 (2023). [271] Colin Wei, Sang Michael Xie, and Tengyu Ma. “Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 16158–16170. [272] Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. “Sustainable ai: Environmental implications, challenges and opportunities”. In: Proceedings of Machine Learning and Systems 4 (2022), pp. 795–813. [273] Xueqing Wu, Jiacheng Zhang, and Hang Li. “Text-to-Table: A New Way of Information Extraction”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 2518–2533. [274] Yijun Xiao and William Yang Wang. “On Hallucination and Predictive Uncertainty in Conditional Language Generation”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, pp. 2734–2744. [275] Huimin Xu, Wenting Wang, Xinnian Mao, Xinyu Jiang, and Man Lan. “Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, pp. 5214–5223. 142 [276] Jiacheng Xu and Greg Durrett. “Neural extractive text summarization with syntactic compression”. In: arXiv preprint arXiv:1902.00863 (2019). [277] Shusheng Xu, Xingxing Zhang, Yi Wu, Furu Wei, and Ming Zhou. “Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, pp. 1784–1795. [278] Zhen Xu, Bingquan Liu, Baoxun Wang, Cheng-Jie Sun, Xiaolong Wang, Zhuoran Wang, and Chao Qi. “Neural response generation via gan with an approximate embedding layer”. In: Proceedings of the 2017 conference on empirical methods in natural language processing. 2017, pp. 617–626. [279] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. “Byt5: Towards a token-free future with pre-trained byte-to-byte models”. In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 291–306. [280] Kenji Yamada and Kevin Knight. “A decoder for syntax-based statistical MT”. In: Proceedings of the 40th Annual meeting of the Association for Computational Linguistics. 2002, pp. 303–310. [281] Li Yang, Qifan Wang, Jingang Wang, Xiaojun Quan, Fuli Feng, Yu Chen, Madian Khabsa, Sinong Wang, Zenglin Xu, and Dongfang Liu. “Mixpave: Mix-prompt tuning for few-shot product attribute value extraction”. In: Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 9978–9991. [282] Yijing Yang, Wei Wang, Hongyu Fu, C-C Jay Kuo, et al. “On supervised feature selection from high dimensional feature spaces”. In: APSIPA Transactions on Signal and Information Processing 11.1 (2022). [283] Yuhao Yang, Chao Huang, Lianghao Xia, and Chenliang Li. “Knowledge graph contrastive learning for recommendation”. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022, pp. 1434–1443. [284] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. “Breaking the softmax bottleneck: A high-rank RNN language model”. In: arXiv preprint arXiv:1711.03953 (2017). [285] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. “Xlnet: Generalized autoregressive pretraining for language understanding”. In: Advances in neural information processing systems 32 (2019). [286] Michihiro Yasunaga, Jungo Kasai, and Dragomir Radev. “Robust Multilingual Part-of-Speech Tagging via Adversarial Training”. In: Proceedings of NAACL-HLT. 2018, pp. 976–986. [287] Sho Yokoi, Ryo Takahashi, Reina Akama, Jun Suzuki, and Kentaro Inui. “Word Rotator’s Distance”. In: arXiv preprint arXiv:2004.15003 (2020). [288] Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. “Word-level Textual Adversarial Attacking as Combinatorial Optimization”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, pp. 6066–6080. 143 [289] Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B Tenenbaum, and Chuang Gan. “Planning with large language models for code generation”. In: arXiv preprint arXiv:2303.05510 (2023). [290] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT”. In: International Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=SkeHuCVFDr. [291] Xiang Zhang, Junbo Zhao, and Yann LeCun. “Character-level convolutional networks for text classification”. In: Advances in neural information processing systems 28 (2015), pp. 649–657. [292] Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, Jingbo Shang, Christos Faloutsos, and Jiawei Han. “OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision”. In: Proceedings of the ACM Web Conference 2022. WWW ’22. Virtual Event, Lyon, France: Association for Computing Machinery, 2022, pp. 3153–3161. isbn: 9781450390965. doi: 10.1145/3485447.3512035. [293] Yian Zhang, Alex Warstadt, Haau-Sing Li, and Samuel R Bowman. “When do you need billions of words of pretraining data?” In: arXiv preprint arXiv:2011.04946 (2020). [294] Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel Bowman. “When Do You Need Billions of Words of Pretraining Data?” In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 1112–1125. [295] Jian Zhao and Xiao-long Wang. “Chinese POS tagging based on maximum entropy model”. In: Proceedings. International Conference on Machine Learning and Cybernetics. Vol. 2. IEEE. 2002, pp. 601–605. [296] Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. “Learning discourse-level diversity for neural dialog models using conditional variational autoencoders”. In: arXiv preprint arXiv:1703.10960 (2017). [297] Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. “MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, Aug. 2019. [298] Hao Zheng and Mirella Lapata. “Sentence centrality revisited for unsupervised summarization”. In: arXiv preprint arXiv:1906.03508 (2019). [299] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. “Extractive summarization as text matching”. In: arXiv preprint arXiv:2004.08795 (2020). [300] Zexuan Zhong, Dan Friedman, and Danqi Chen. “Factual Probing Is [MASK]: Learning vs. Learning to Recall”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, pp. 5017–5033. 144 [301] Qingyu Zhou, Furu Wei, and Ming Zhou. “At Which Level Should We Extract? An Empirical Analysis on Extractive Document Summarization”. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, pp. 5617–5628. [302] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. “Large language models are human-level prompt engineers”. In: arXiv preprint arXiv:2211.01910 (2022). [303] Julia El Zini and Mariette Awad. “On the explainability of natural language processing deep models”. In: ACM Computing Surveys 55.5 (2022), pp. 1–31. [304] Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. “Bilingual word embeddings for phrase-based machine translation”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1393–1398. 145
Abstract (if available)
Abstract
Syntax in language processing controls the structure of textual data, playing a crucial role in textual data understanding and generation. For example, syntax in natural language sentences governs the relationships between words, which is crucial for grasping the sentence's overall meaning. In programming languages, syntax defines the proper combination of symbols, ensuring computers can interpret and execute code statements accurately. This thesis has two primary objectives: 1) Develop efficient methods for constructing syntactic structures. 2) Investigate the significance of syntax and integrate syntax-aware techniques into various Natural Language Processing (NLP) applications.
We first build an efficient Part-of-speech (POS) tagger. POS denotes a word's syntactic function in a sentence. This form of syntactic information is crucial for constructing sentence syntactic structures. In the rest of the thesis, we explore syntax-aware techniques in various NLP applications, including word-level, sentence-level, document-level, and structured-data-level tasks.
On the word-level task, we apply syntax-aware techniques to word embedding learning. Word embedding methods learn word representations from context, which is based on the distributional hypothesis. Most previous word embedding methods use sliding windows to select sequential context, and the learned word embeddings are for general purposes. By the context selected by dependency parsing and enhancement from word-class mutual information, our proposed classification-specific dependency-based word embedding outperforms several state-of-the-art word embedding methods on text classification tasks. On sentence-level tasks, we apply syntax-aware techniques to sentence similarity evaluation. Sentence similarity evaluation measures the semantic similarity between sentences, which is important in information retrieval, text summarization, and question answering. In this thesis, we propose a syntax-aware Word Mover's Distance (SynWMD) algorithm to address the limitations of the original WMD. The SynWMD approach improves the performance of sentence similarity evaluation by incorporating the dependency parse tree technique in both word flow assignment and word distance modeling. On document-level tasks, we apply syntax-aware techniques to text summarization. In this thesis, we conduct a comprehensive study of the impact by further compressing the selected sentences in the summary. The results show that under syntactic compression rules, further compression of selected sentences can significantly enhance the performance of summarization models. Additionally, we propose an unsupervised compressive method that leverages word and sentence embeddings to select phrases and sentences as the final summary. The experimental results demonstrate that this method improves performance compared to traditional sentence-level extractive text summarization. Lastly, on the structured-date-level task, we present a novel decoding method called Sub-structure Beam Search (SUBS) for generating structured data. Unlike conventional natural language, structured textual data, such as knowledge entities and tabular data, follows specific syntax formats. By incorporating sub-structure information from the structured data during the text generation decoding process, our decoding method significantly enhances the LLM generation quality of structured data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Human motion data analysis and compression using graph based techniques
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Aggregating symbols for language models
PDF
Green image generation and label transfer techniques
PDF
Generating psycholinguistic norms and applications
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Fairness in natural language generation
PDF
Multimodal reasoning of visual information and natural language
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Green learning for 3D point cloud data processing
PDF
Towards understanding language in perception and embodiment
PDF
Modeling, learning, and leveraging similarity
PDF
Building generalizable language models for code processing
Asset Metadata
Creator
Wei, Chengwei
(author)
Core Title
Syntax-aware natural language processing techniques and their applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
05/14/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
language model,natural language processing,OAI-PMH Harvest,syntactic structure
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Ortega, Antonio (
committee member
), Swayamdipta, Swabha (
committee member
)
Creator Email
chengwei@usc.edu,chengwei7272@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939988
Unique identifier
UC113939988
Identifier
etd-WeiChengwe-12945.pdf (filename)
Legacy Identifier
etd-WeiChengwe-12945
Document Type
Dissertation
Format
theses (aat)
Rights
Wei, Chengwei
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1154
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
language model
natural language processing
syntactic structure