Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Annotating FrameNet via structure-conditioned language generation
(USC Thesis Other)
Annotating FrameNet via structure-conditioned language generation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANNOTATING FRAMENET VIA STRUCTURE-CONDITIONED LANGUAGE GENERATION by Xinyue Cui A Thesis Presented to the FACULTY OF THE USC VITERBI SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (COMPUTER SCIENCE) May 2024 Copyright 2024 Xinyue Cui Acknowledgements First and foremost, I am deeply grateful for my advisor, Dr. Swabha Swayamdipta, whose guidance and support have been invaluable throughout this research. I also appreciate the insightful feedbacks from my thesis committee members, Dr. Swabha Swayamdipta, Dr. Robin Jia, and Dr. Jieyu Zhao. My heartfelt thanks go to my family—my parents, Chunli Cui and Lina Liu, and grandparents, Junming Liu and Zhihong Wang—for their unwavering support and unconditional love. To my boyfriend, Ziyu He, thank you for standing by me through the highs and lows. This thesis also benefited from the discussions and input from many of my peers and colleagues whose contributions were immensely helpful. To everyone involved—thank you for making this work possible. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: FrameNet and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Sister LU Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3: Generating FrameNet Annotations via Frame-Semantic Conditioning . . . . . . . . . . 7 3.1 Selecting Candidate FEs for Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Generating Semantically Consistent Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Filtering Inconsistent Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Intrinsic Evaluation of Generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4.2 Reference-based Evalution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 4: Augmenting Data for Frame-SRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Experiments on Diminishing Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Experiments on Verb LU Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 5: Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iii List of Tables 2.1 Number of instances and average number of all, core, and candidate FE spans per sentence, categorized by POS tags of LUs in FrameNet. C. FEs represents Core FEs and Cd. FEs represents Candidate FEs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Examples of non-verb LUs where semantic consistency is preserved after sister LU replacement. The original LU is in teal and the replacement LU is in orange and parentheses. 6 3.1 Example sentences after LU replacement with FEs of type Agent, Self_mover, or their descendants, which are compatible with the new replacement LU. The ancestors of FE types are reported in parentheses. The FEs are shown in teal and the replacement LUs are shown in orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Template of finetuning T5 models on an example sentence. . . . . . . . . . . . . . . . . . . 9 3.3 Example prompts for GPT-4 models. Texts in green only appear in FE-Conditioning and Frame-FE-Conditioning models. Texts in orange only appear in Frame-FEConditioning models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Training set size with and without lexicographic data, development set size, and test set size in FrameNet 1.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 Example Generations of GPT-4 | FE, our best model according to human acceptance. The two marks in human evaluation represent whether the generations satisfy the two criteria individually: (1) sentence-level semantic coherence and (2) preservation of all FE types. A sentence is deemed acceptable only when it satisfies both criteria. The new replacement LUs are presented in orange or parentheses. Masked FEs are presented in teal and their corresponding FE types in angle brackets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.6 Perplexity, FE fidelity and human acceptability of T5 and GPT-4 generations conditioned on different degrees of semantic information. Number of instances after filtering are in parantheses. Best results are in boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.7 Log BARTScore, ROUGE scores and perplexity of generations on FrameNet test set without LU replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iv 4.1 F1 score of all LUs and augmented LUs under unaugmented setting, augmented settings with generations from T5 | FE and GPT-4 | Frame+FE, averaged across 3 trials. . . . . . . . 16 4.2 F1 score of all verb LUs and augmented LUs in augmentation experiments using different percentages of augmentations generated by T5 | FE with and without filtering, compared to baseline results without data augmentation. Best results are in boldface . . . . . . . . . 18 4.3 F1 score of all verb LUs and augmented LUs in augmentation experiments using different percentages of augmentations generated by T5 | FE with and without filtering, compared to baseline results without data augmentation. Best results are in boldface . . . . . . . . . 18 v List of Figures 1.1 Our framework to generate frame semantic annotated data. Following Pancholy, Petruck, and Swayamdipta [13], we replace a sister LU with the target LU in an annotated sentence (0;§2.1). We select FEs appropriate for generating a new structure-annotated sentence (1;§3.1), and execute generation via fine-tuning T5 or prompting GPT-4 (2;§3.2). Finally, we filter out sentences that fail to preserve LU-FE relationships under FrameNet (3;§3.3). . 2 4.1 Learning curves for our frame-SRL model and Lin, Sun, and Zhang [10]’s end-to-end parser show diminishing returns on adding more human-annotated training data. The triangle marker denotes the performance of Lin, Sun, and Zhang [10]’s parser on SRL with gold frame and LU, as reported in their study. . . . . . . . . . . . . . . . . . . . . . . . . . 17 vi Abstract Despite the mounting evidence for generative capabilities of language models in understanding and generating natural language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel framesemantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Nevertheless, we discover that generated frame-semantic structured data is ineffective at training data augmentation for frame-semantic role labeling. Our study concludes that while generating high-quality, semantically rich data might be within reach, their downstream utility remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks. vii Chapter 1 Introduction ch:introduction Large language models (LLMs) have revolutionized generative AI by demonstrating unprecedented capabilities in generating natural language. These successes demonstrate language understanding capabilities, raising the question of their utility towards tasks involving explicit linguistic structure manipulation. Not only does this help us understand the depth of LLMs’ linguistic capabilities but also serves to enrich existing annotated sources of linguistic structure. In this work, we investigate the abilities of LLMs to generate annotations for one such resource of linguistic structure, FrameNet [18, 17]: a lexical resource grounded in the theory of frame semantics [3]. We propose an approach for language generation conditioned on frame-semantic structure such that the generation is consistent with the structure, is acceptable by humans and is useful for a downstream task, namely frame-semantic role labeling [4]. Previous works have explored semantic-controlled generation with PropBank [16] as opposed to FrameNet, richer in semantic relationships, allowing for a deeper evaluation of language models’ semantic understanding. Our framework for generating frame-semantic data leverages both the FrameNet hierarchy and LLMs’ generative capabilities to transfer annotations from existing sentences to new examples. Specifically, we follow a frame structure-condition language generation framework, focusing on specific spans in the sentence such that the resulting sentence follows the given frame structure and is also acceptable to humans. Overall, 1 1. Select FE spans for Replacement Growing up, boys are disciplined for breaking the rules. Time Evaluee Reason discipline.v REWARDS_AND_PUNISHMENTS Target LU Sister LU • reward.v • disciplinary.a • penalty.n • discipline.v • punish.v Growing up, <MASK> are rewarded <MASK>. Time Evaluee Reason reward.v Time Evaluee Reason reward.v Growing up, children are rewarded often. Growing up, boys are rewarded for breaking the rules. reward.v Time Evaluee Reason Growing up, girls are rewarded for good behavior. reward.v Time Evaluee Reason T5 GPT-4 2. Structure-Conditioned Generation 3. Filter Generations w/ Inconsistent FEs 0. Replace Sister LU Figure 1.1: Our framework to generate frame semantic annotated data. Following Pancholy, Petruck, and Swayamdipta [13], we replace a sister LU with the target LU in an annotated sentence (0;§2.1). We select FEs appropriate for generating a new structure-annotated sentence (1;§3.1), and execute generation via fine-tuning T5 or prompting GPT-4 (2;§3.2). Finally, we filter out sentences that fail to preserve LU-FE relationships under FrameNet (3;§3.3). figure:pipeline we follow an overgenerate-and-filter pipeline, to ensure semantic consistency of the resulting annotations. Our framework is outlined in Figure 1.1. Our intrinsic evaluation, via both human judgment and automated metrics, show that the generated sentences preserve the intended frame-semantic structure, compared to existing approaches [13]. As an extrinsic evaluation, we use our generations to augment the training data for frame-semantic role labeling: identifying and classifying spans in the sentence corresponding to FrameNet frames. However, this effort does not yield improvements, echoing observations from other studies that have reported challenges in leveraging LLMs for semantic parsing tasks, such as constituency parsing [1], dependency parsing [8], and abstract meaning representation parsing [2]. These 2 findings prompt further investigation into the application of LLMs in semantic parsing and the nuances of enhancing model performance in complex NLP tasks. 3 Chapter 2 FrameNet and Extensions ch:framenet-bg Frame semantics theory [4] posits that understanding a word requires access to a semantic frame—a conceptual structure that represents situations, objects, or actions, providing context to the meaning of words or phrases. Frame elements (FEs) are the roles involved in a frame, describing a certain aspect of the frame. A Lexical Unit (LU) is a pairs tokens (specifically a word lemma and its part of speech) to the evoked frames. As illustrated in Figure 1.1, the token “disciplined” evokes the LU discipline.v, which is associated with the frame REWARDS_AND_PUNISHMENT, with FEs including Time, Evaluee, and Reason. Grounded in frame semantics theory, FrameNet [18] is a lexical database, featuring sentences that are annotated by linguistic experts according to frame semantics. Within FrameNet, the majority of sentences are annotated with a focus on a specific LU within each sentence, which is referred to as lexicographic data; Fig. 1.1 shows such an instance. A subset of FrameNet’s annotations consider all LUs within a sentence; these are called full-text data; Fig. 1.1 does not consider other LUs such as grow.v or break.v. FrameNet has defined 1,224 frames, covering 13,640 lexical units.The FrameNet hierarchy also links FEs using 10,725 relations. However, of the 13,640 identified LUs, only 62% have associated annotations. Our approach seeks to automatically generate annotated examples for the remaining 38% of the LUs, towards increasing coverage in FrameNet without laborious manual annotation. 4 2.1 Sister LU Replacement sec:sister-replace Pancholy, Petruck, and Swayamdipta [13] propose a solution to FrameNet’s coverage problem using an intuitive approach: since LUs within the same frame tend to share similar annotation structures, they substitute one LU (the target LU) with another (a sister LU) to yield a new sentence. This replacement approach only considers LUs with the same POS tag to preserve the semantics of the original sentence; for instance, in Fig. 1.1, we replace the sister LU discipline.v with the target LU reward.v. However, due to the nuanced semantic differences between the two LUs, the specific content of the FE spans in the original sentence may no longer be consistent with the target LU in the new sentence. Indeed Pancholy, Petruck, and Swayamdipta [13] report such semantic mismatches as their primary weakness. To overcome this very weakness, our work proposes leveraging language models to generate FE spans that better align with the target LU, as described subsequently. For the rest of this work, we focus solely on verb LUs, where initial experiments showed that the inconsistency problem was the most severe. Table 2.1 illustrates a breakdown of FrameNet corpus categorized by the POS tags of the LUs. Specifically, we report the number of instances and the average count of candidate FEs (FEs to generate, more details in 3.1) per sentence, corresponding to LUs of each POS category. The two predominant categories are verb (v) LUs and noun (n) LUs, with verb LUs exhibiting a higher average of candidate FE spans per sentence compared to noun LUs. Moreover, Table 2.2 shows several examples of non-verb LU replacement, where the resulting sentences mostly preserve semantic consistency. Given the extensive number of annotated verb LUs available for LU replacement and candidate FEs per sentence for masking and subsequent structure-conditioned generation, our generation methodology is primarily applied to verb LUs. 5 LU POS # Inst. # FEs # C. FEs # Cd. FEs v 82710 2.406 1.945 1.354 n 77869 1.171 0.675 0.564 a 33904 1.467 1.211 1.025 prep 2996 2.212 2.013 1.946 adv 2070 1.851 1.717 1.655 scon 758 1.906 1.883 1.883 num 350 1.086 0.929 0.549 art 267 1.547 1.543 1.408 idio 105 2.162 1.933 1.486 c 69 1.957 0.841 0.826 Table 2.1: Number of instances and average number of all, core, and candidate FE spans per sentence, categorized by POS tags of LUs in FrameNet. C. FEs represents Core FEs and Cd. FEs represents Candidate FEs. tab:pos-stats Frame LU Sentence Leadership king.n (rector.n) No prior Scottish king (rector) claimed his minority ended at this age. Sounds tinkle.n (yap.n) Racing down the corridor, he heard the tinkle (yap) of metal hitting the floor. Body_part claw.n (back.n) A cat scratched its claws (back) against the tree. Disgraceful _- situation shameful.a (disgraceful.a) This party announced his shameful (disgraceful) embarrassments to the whole world . Frequency always.adv (rarely.adv) The temple is always (rarely) crowded with worshippers . Concessive despite.prep (in spite of.prep) Despite (In spite of) his ambition , Gass ’ success was short-lived . Conditional _Occurrence supposing.scon (what if.scon) So , supposing (what if) we did get a search warrant , what would we find ? Table 2.2: Examples of non-verb LUs where semantic consistency is preserved after sister LU replacement. The original LU is in teal and the replacement LU is in orange and parentheses. tab:noun-replace 6 Chapter 3 Generating FrameNet Annotations via Frame-Semantic Conditioning ch:method We propose an approach to automate the expansion of FrameNet annotations by generating annotated data with language models. Given sister LU-replaced annotations (§2.1;Pancholy, Petruck, and Swayamdipta [13]), we select FE spans which are likely to be semantically inconsistent (§3.1), generate new sentences with replacement spans by conditioning on frame-semantic structure information (§3.2) and finally filter inconsistent generations (§3.3). 3.1 Selecting Candidate FEs for Generation sec:method-masking We identify the FEs which often result in semantic inconsistencies, in order to replace them. Our selection of the ideal candidate spans for replacement takes into account the FE type, its ancestory under FrameNet, and the span’s syntactic phrase type, as detailed below: 1. FE Type Criterion: The FE span to be generated must belong to a core FE type, i.e., the essential FEs that are necessary to fully understand the meaning of a frame. 2. Ancestor Criterion: The FE should not possess Agent or Self-mover ancestors. 3. Phrase Type Criterion: The FE’s phrase type should be a prepositional phrase. These criteria stem from preliminary experiments, where we have conducted manual analysis on the compatibility of FE spans with replacement LUs on 50 example generations. As demonstrated through the 7 Sentence After Replacement FE Type She was bending over a basket of freshly picked flowers , organizing them to her satisfaction . Agent (Agent) The woman got to her feet , marched indoors , was again hurled out . Self_mover (Self_mover) While some presumed her husband was dead , Sunnie refused to give up hope . Cognizer (Agent) Table 3.1: Example sentences after LU replacement with FEs of type Agent, Self_mover, or their descendants, which are compatible with the new replacement LU. The ancestors of FE types are reported in parentheses. The FEs are shown in teal and the replacement LUs are shown in orange. tab:cand-fe sentence in Figure 1.1, the FE Type criterion can effectively eliminate non-core FE that do not need to be masked, i.e., "Growing up" of FE type Time. Also, the Phrase Type Criterion can identify the candidate FE "for breaking the rules", which is a prepositional phrase. Moreover, we find that FEs of Agent or Self-mover type describes a human subject, which is typically independent of the LU evoked in the sentence. Since FE types within the same hierarchy tree share similar properties, we exclude FEs of Agent and Self-mover types, as well as any FEs having ancestors of these types, from our masking process, as illustrated in Table 3.1. Qualitative analyses revealed that it suffices to meet criterion (1) while satisfying either (2) or (3). For instance, in Fig. 1.1, under REWARDS_AND _PUNISHMENTS, only the FEs Evaluee and Reason are core (and satisfy (2)) while Time is not; thus we only select the last two FE spans for generation. 3.2 Generating Semantically Consistent Spans sec:method-generation We generate semantically consistent FE spans for selected candidate FEs via two approaches: finetuning a T5-large [15] model and prompting GPT-4 Turbo, following Mishra et al. [12]. In each case, we condition the generation on different degrees of semantic information: No Conditioning We generate FE spans without conditioning on any semantic labels. 8 FE-Conditioning The generation is conditioned on the type of FE span to be generated. Frame+FE-Conditioning The generation is conditioned on both the frame and the FE type. T5-large Fine-Tuning During the fine-tuning process of T5-large, we incorporate semantic information using special tokens, which is demonstrated in Table 3.2 through the example sentence in Figure 1.1. T5 models are fine-tuned on full-text data and lexicographic data in FrameNet for 5 epochs with a learning rate of 1e-4 and an AdamW [11] optimizer of weight decay 0.01. The training process takes around 3 hours on 4 NVIDIA RTX A6000 GPUs. Model Input No Conditioning Growing up, <mask> are rewarded <mask>. FE-Conditioning Growing up, <FE: Evaluee> <mask> </FE: Evaluee> are rewarded <FE: Reason> <mask> </FE: Reason>. Frame-FE-Conditioning Growing up, <mask> are rewarded <mask> . Table 3.2: Template of finetuning T5 models on an example sentence. tab:t5-format GPT-4 Few-shot Prompting When instructing GPT-4 models to generate FE spans, we provide the task title, definition, specific instructions, and examples of input/output pairs along with explanations for each output, as demonstrated in Table 3.3. The above process produces new sentences with generated FE spans, which align better with the target LU, thereby preserving the original frame-semantic structure. However, despite the vastly improved generative capabilities of language models, they are still prone to making errors, thus not guaranteeing the semantic consistency we aim for. Hence, we adopt an overgenerate-and-filter approach [7, 20]: generate multiple candidates and aggressively filter out those that are semantically inconsistent. 9 Title Sentence completion using frame elements Definition You need to complete the given sentence containing one or multiple blanks (<mask>). Your answer must be of the frame element type specified in FE Type. Example Input Frame: Rewards_and_Punishments. Lexical Unit: discipline.v. Sentence: Growing up, <mask> are disciplined <mask>. FE Type: Evaluee, Reason. Example Output boys, for breaking the rules Reason The frame "Rewards_and_Punishments" is associated with frame elements "Evaluee" and "Reason". The answer "boys" fills up the first blank because it is a frame element (FE) of type "Evaluee". The answer "for breaking the rules" fills up the second blank because it is an FE of type "Reason". Prompt Fill in the blanks in the sentence based on the provided frame, lexical unit and FE type. Generate the spans that fill up the blanks ONLY. Do NOT generate the whole sentence or existing parts of the sentence. Separate the generated spans of different blanks by a comma. Generate the output of the task instance ONLY. Do NOT include existing words or phrases before or after the blank. Task Input Frame: Experiencer_obj. Lexical Unit: please.v. Sentence: This way <mask> are never pleased <mask> . FE Type: Experiencer, Stimulus. Task Output Table 3.3: Example prompts for GPT-4 models. Texts in green only appear in FE-Conditioning and Frame-FE-Conditioning models. Texts in orange only appear in Frame-FE- Conditioning models. tab:prompt-example 3.3 Filtering Inconsistent Generations sec:method-filtering We design a filter to ensure that the generated sentences preserve the same semantics as the expert annotations from the original sentence. This requires the new FE spans to maintain the same FE type as the original. To this end, we train an FE type classifier on FrameNet by finetuning SpanBERT [5], the state-of-the-art model for span classification. Our classifier operates on the principle of classifying one FE span at a time. In cases where multiple FE spans are present within a single sentence, we split these into distinct instances for individual processing. For each instance, we introduce special tokens—<LU_START> and <LU_END>—around the LU, and <FE_START> and <FE_END> around the FE span. Additionally, the name of the evoked frame is appended to the end of the sentence. To train our classifier to effectively discern valid FE spans from invalid ones, we augment training data with instances where randomly selected word spans are labeled as “Not an FE”, constituting approximately 10% of the training data. The 10 FE classifier is fine-tuned on full-text data and lexicographic data for 20 epochs with a learning rate of 2e-5 and an AdamW optimizer with weight decay 0.01. The training process takes around 4 hours on 4 NVIDIA RTX A6000 GPUs. Our resulting FE classifier attains 95% accuracy, when trained and tested on the standard FrameNet 1.7 splits. Table 3.4 shows the distribution of the training, development, and test datasets following standard splits on FrameNet 1.7 from prior work [6, 19, 14, 22]. Both the development and test datasets consist exclusively of full-text data, whereas any lexicographic data, when utilized, is solely included within the training dataset. Since our generation approach is designed to produce lexiDataset Split Size Train (full-text + lex.) 192,364 Train (full-text) 19,437 Development 2,272 Test 6,462 Table 3.4: Training set size with and without lexicographic data, development set size, and test set size in FrameNet 1.7. tab:train-test-split cographic instances annotated for a single LU, when augmenting full-text data (§4), we break down each full-text example by annotated LUs and process them individually as multiple lexicographic examples. We propose a new metric FE fidelity, which measures the accuracy of generated FE types compared to the originals, computed via our FE classifier. We use a strict filtering criterion: removing all generations where our classifier detects a single FE type inconsistency, i.e. only retaining instances with perfect FE fidelity. 3.4 Intrinsic Evaluation of Generations sec:intrinsic We evaluate our generated frame-semantic annotations by comparing them against those from Pancholy, Petruck, and Swayamdipta [13], before and after filtering (§3.3). We consider three metrics: perplexity under Llama-2-7B for overall fluency and naturalness, FE fidelity, and human acceptance. We randomly sampled 1000 LUs without annotations and used our generation framework to generate one instance each for these LUs. For human acceptability, we perform fine-grained manual evaluation on 200 examples 11 sampled from the generated instances.∗ We deem an example acceptable if the FE spans semantically align with the target LU and preserve FE role definitions under FrameNet. Table 3.5 demonstrates the manual evaluation process on 10 selected examples. 3.4.1 Main Results Table 3.6 summarizes our main results. Our filtering approach—designed for perfect FE fidelity—improves performance under the other two metrics. Compared to rule-based generations from Pancholy, Petruck, and Swayamdipta [13], our filtered generations fare better under both perplexity and human acceptability, indicating improved fluency and semantic consistency. Most importantly, models incorporating semantic information, i.e., FE-conditioned and Frame+FEconditioned models, achieve higher human acceptance and generally lower perplexity compared to their no-conditioning counterparts, signifying that semantic cues improve both fluency and semantic consistency. Even before filtering, FE fidelity increases with the amount of semantic conditioning, indicating the benefits of structure-based conditioning. 3.4.2 Reference-based Evalution We evaluate the quality of generated sentences using reference-based metrics such as ROUGE [9] and BARTScore [21] by performing 3.1 and 3.2 on the test split of FrameNet 1.7 with verb LUs. As observed in Table 3.7, the T5 | FE model surpasses others in ROUGE scores, signifying superior word-level precision, while GPT-4 achieves the highest BARTScore, indicating its generated sentences most closely match the gold-standard FE spans in terms of meaning. For reference-free metrics, GPT-4 | FE performs well in both perplexity and FE fidelity, showcasing its ability to produce the most fluent and semantically coherent generations. ∗Human evaluation is mainly conducted by the author, complemented by two independent volunteers unfamiliar with generated data evaluating the same examples from GPT-4 | Frame+FE. The resulting human acceptance rate only differs by around 1%, suggesting a consistent validation of generation quality across different observers. 12 Frame LU Sentence Original FEs GPT-4 | FE Human Eval. Verification verify.v (confirm.v) The bank, upon confirming <Unconfirmed_content>, released the goods to the customer. compliance with the terms of the credit the transaction details ✓ ✓ Distributed _position blanket.v (line.v) <Theme> lines <Location> and the lake is covered with ice. snow many feet deep, the land the first snowfall, the shore ✓ ✓ Being_located sit.v (stand.v) Against the left-hand wall nearest to the camera are three storage shelves; <Theme> stands <Location>. a lidless unvarnished coffin in the process of construction, on the middle shelf a tall vase, on the top shelf ✓ ✓ Evoking conjure.v (evoke.v) A name like Pauline Gascoyne inevitably evoke <Phenomenon>. an image of a bimbo Gazza in a GTi memories of a bygone era ✓ ✓ Event happen.v (take place.v) Jamaicans appear to worry little about the future; sometimes it seems that they worry little even about what takes place . in the next few minutes tomorrow ✓ ✓ Self_motion climb.v (walk.v) My mother parked her bicycle in the shoulder and took my hand, and we walked <Goal>. to the top of the hill to the park ✓ ✓ Process_- materials stain.v (process.v) If you accidentally process <Material> <Alterant>, leave it for a week or two. walls, with woodworm fluid the wood, too much ✓ × Self_motion creep.v (make.v) Matilda took the knife she had been eating with, and all four of them make . towards the dining-room door their way to the living room ✓ × Hunting hunt.v (fish.v) <Food> too were mercilessly fished and often left, plucked and dying, where the sealers found them. The albatrosses The penguins × ✓ Change_- position _on_a_scale dip.v (rise.v) <Attribute> rose <Final _value> in the summer, but has recently climbed above $400 and last night was nudging $410. The price per ounce, below $360 The price, to $410 × ✓ Table 3.5: Example Generations of GPT-4 | FE, our best model according to human acceptance. The two marks in human evaluation represent whether the generations satisfy the two criteria individually: (1) sentence-level semantic coherence and (2) preservation of all FE types. A sentence is deemed acceptable only when it satisfies both criteria. The new replacement LUs are presented in orange or parentheses. Masked FEs are presented in teal and their corresponding FE types in angle brackets. tab:human-eval 13 Before Filtering (|Dtest|=1K) After Filtering (FE Fid. = 1.0) FE Fid. ppl. Human (|Dtest|=200) ppl.(|Dtest|) Human (|Dtest|) Human (FN 1.7) 0.979 78.1 1.000 97.0 (975) 1.000 (199) Pancholy et al. 0.953 127.8 0.611 146.0 (947) 0.686 (189) T5 0.784 139.3 0.594 117.5 (789) 0.713 (156) T5 | FE 0.862 127.6 0.711 112.7 (850) 0.777 (168) T5 | Frame + FE 0.882 136.8 0.644 124.4 (873) 0.704 (172) GPT-4 0.704 114.9 0.528 114.2 (724) 0.723 (132) GPT-4 | FE 0.841 106.3 0.700 103.4 (838) 0.826 (164) GPT-4 | Frame + FE 0.853 117.2 0.733 111.8 (845) 0.821 (165) Table 3.6: Perplexity, FE fidelity and human acceptability of T5 and GPT-4 generations conditioned on different degrees of semantic information. Number of instances after filtering are in parantheses. Best results are in boldface. tab:intrinsic BARTScore ROUGE-1 ROUGE-L Perp. FE Fid. Human - - - 72.165 - T5 -5.939 0.301 0.298 447.874 0.829 T5 | FE -5.922 0.318 0.316 434.231 0.840 T5 | Frame + FE -6.179 0.276 0.274 441.639 0.843 GPT-4 -4.060 0.228 0.227 85.820 0.880 GPT-4 | FE -4.336 0.218 0.217 82.977 0.930 GPT-4 | Frame + FE -4.395 0.210 0.209 87.548 0.929 Table 3.7: Log BARTScore, ROUGE scores and perplexity of generations on FrameNet test set without LU replacement. tab:intrinsic-sim 14 Chapter 4 Augmenting Data for Frame-SRL ch:augmentation Beyond improving FrameNet coverage, we investigate the extrinsic utility of our generations as training data to improve the frame-SRL task, which involves identifying and classifying FE spans in sentences for a given frame-LU pair. Following Pancholy, Petruck, and Swayamdipta [13], we adopt a modified Frame-SRL task, which considers gold-standard frames and LUs. To explore state-of-the-art LLMs’ capabilities in handling this task, we applied GPT-4 to directly perform frame-SRL parsing under a few-shot setting. Despite GPT-4’s advanced language processing abilities, the model achieved a test F1 score of only 0.228. This result shows that, while GPT-4 is a powerful tool, it still struggles with the nuanced semantic understanding required for complex SRL tasks. Therefore, the need for specialized methodologies to enhance frame-SRL performance remains critical. To this end, we fine-tune a SpanBERT model on FrameNet’s full-text data as our parser and avoid using existing parsers due to their complex problem formulation [10], or need for extra frame and FE information [22]. As a pilot study, we prioritize augmenting the training data with verb LUs with F1 scores below 0.75 on average. This serves as an oracle augmenter targeting the lowest-performing LUs in the test set. For the generation of augmented data, we use our top-performing models within T5 and GPT-4 models according to human evaluation: T5 | FE and GPT-4 | Frame+FE models. Of 2,295 LUs present in the test data, 370 were 15 selected for augmentation, resulting in 5,631 generated instances. After filtering, we retain 4,596 instances from GPT-4 | Frame+FE and 4,638 instances from T5 | FE. All LUs F1 Aug. LUs F1 Unaugmented 0.677 ± 0.004 0.681 ± 0.012 Aug. w/ T5 | FE 0.683 ± 0.000 0.682 ± 0.006 Aug. w/ GPT-4 | Frame+FE 0.684 ± 0.002 0.677 ± 0.010 Table 4.1: F1 score of all LUs and augmented LUs under unaugmented setting, augmented settings with generations from T5 | FE and GPT-4 | Frame+FE, averaged across 3 trials. tab:aug-results Table 4.1 shows the Frame-SRL performance, with and without data augmentation on all LUs and on only the augmented LUs. Despite the successes with human acceptance and perplexity, our generations exhibit marginal improvement on overall performance, and even hurt the performance on the augmented LUs. We hypothesize that this stagnation in performance stems from two factors: (1) the phenomenon of diminishing returns experienced by our Frame-SRL parser elaborated in 4.1, and (2) the limited diversity in augmented data. Apart from the newly generated FE spans, the generated sentences closely resemble the original, thereby unable to introduce novel signals for frame-SRL. We speculate that Pancholy, Petruck, and Swayamdipta [13] are successful at data augmentation despite using only sister LU replacement perhaps because they use a weaker parser [19], which leaves more room for improvement compared to ours. 4.1 Experiments on Diminishing Returns sec:augment-diminishing To further investigate our failure to improve frame-SRL performance via data augmentation, we conduct a pilot using original FrameNet data for augmentation under our SpanBERT model. We conduct experiments using increasing proportions of FrameNet training data under three settings: (1) training our SRL parser with full-text data, (2) training our SRL parser with both full-text and lexicographic data (which contains 10x more instances), and (3) training an existing frame semantic parser [10] ∗ with full-text data, to control for the use of our specific parser. ∗ Lin, Sun, and Zhang [10] break frame-SRL into three subsequent sub-tasks: target identification, frame identification, and SRL, contributing to worse overall performance. 16 0.050.10 0.25 0.50 0.75 1.00 train data percentage 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F1 score fulltext data + lexicographic data fulltext data Lin et al. fulltext data Lin et al. on SRL 25% fulltext data + 6.25% augmentation Figure 4.1: Learning curves for our frame-SRL model and Lin, Sun, and Zhang [10]’s end-to-end parser show diminishing returns on adding more human-annotated training data. The triangle marker denotes the performance of Lin, Sun, and Zhang [10]’s parser on SRL with gold frame and LU, as reported in their study. figure:scalability Figure 4.1 shows that parsers across all three settings exhibit diminishing returns, especially on the second setting, which utilizes the largest training set. This indicates that there seems to be little room for improvement in frame-SRL, even with human annotated data. To further evaluate the quality and utility of our generations without the influence of diminishing returns, we perform data augmentation in a low-data environment as indicated by the learning curve. Specifically, we augment 25% of the full-text training data with an additional 6.25% of data generated using our method. As demonstrated in Figure 4.1, the performance of the model in this scenario not only exceeds that of the 25% dataset without augmentation but the results of the 25% dataset augmented with 6.25% of human-annotated data. This result showcases the high quality of our generated data and the effectiveness of targeted data augmentation under low resource settings. 4.2 Experiments on Verb LU Instances Since our generation method mainly focuses on augmenting verb LUs, we conduct augmentation experiments using a subset of FrameNet that includes only verb LU instances. To ensure model performance 17 on a subset of data, we incorporate lexicographic data with verb LUs into our training set, resulting in a training set enriched with 80.2k examples, a development set comprising approximately 600 examples, and a test set containing about 2k examples. We experimented with different augmentation percentages both with and without filtering, as shown in Table 4.3. We use an oracle augmenter to augment LUs inversely proportional to their F1 scores from the unaugmented experiments. To expand coverage on more LUs during augmentation, we augment all LUs rather than limiting to those with F1 scores below 0.75. Although the improvements are marginal, the outcome from filtered augmentations is generally better than those from their unfiltered counterparts. All LUs F1 Aug. LUs F1 Unaugmented 0.751 0.779 5% Aug. w/o filter 0.745 0.778 5% Aug. w/ filter 0.752 0.781 25% Aug. w/o filter 0.752 0.776 25% Aug. w/ filter 0.753 0.781 Table 4.2: F1 score of all verb LUs and augmented LUs in augmentation experiments using different percentages of augmentations generated by T5 | FE with and without filtering, compared to baseline results without data augmentation. Best results are in boldface tab:aug-results-verb All LUs F1 25% gold 0.751 0.779 25% gold + 6.25% Aug. 0.745 0.778 5% Aug. w/ filter 0.752 0.781 25% Aug. w/o filter 0.752 0.776 25% Aug. w/ filter 0.753 0.781 Table 4.3: F1 score of all verb LUs and augmented LUs in augmentation experiments using different percentages of augmentations generated by T5 | FE with and without filtering, compared to baseline results without data augmentation. Best results are in boldface tab:aug-results-verb 18 Chapter 5 Conclusion and Discussion ch:conclusions Our study provides insights into the successes and failures of LLMs in manipulating FrameNet’s linguistic structures. When conditioned on semantic information, LLMs show improved capability in producing semantically annotated sentences, indicating the value of linguistic structure in language generation. Nevertheless, despite this success, augmenting FrameNet does not lead to performance gains on the downstream frame-SRL task, echoing challenges reported in applying LLMs to other flavors of semantics [1, 8, 2]. These outcomes underline the need for further exploration into how LLMs can be more effectively employed in automating linguistic structure annotation. We look forward to furthering our research to better address the lack of generation diversity in our method, which potentially causes the stagnation in data augmentation performance. One simple yet promising approach is to generate multiple candidates for each LU-instance pair. This method would allow models to explore a wider array of semantic interpretations under the same linguistic structure and ensure the coverage and variety of generated instances after filtering. Another approach to improve generation diversity would be to perform direct sentence generation using GPT-4 guided by frame, LU, and FE names, then apply an existing SRL parser to annotate these sentences, providing silver-standard labels. This approach represents a trade-off between generation diversity and annotation quality, which could could lead to improvements in the models’ handling of the frame-SRL task. 19 Building on these efforts, my PhD research will continue to leverage structures in data to improve the understanding and capabilities of LLMs, tackling some of the persistent challenges in NLP. For instance, a significant challenge in the performance of LLMs is their tendency towards hallucinations: generations unsupported by factual information in the training data. To combat hallucinations, we propose to incorporate frame semantics into the attribution process, linking the generations of existing LLMs to supporting evidence in trusted documents retrieved from the web. Subsequently, we edit the generations to remove factual inconsistencies with retrieved sources of knowledge, offering a controlled and granular solution to enhance the reliability of LLMs. 20 Bibliography [1] Xuefeng Bai, Jialong Wu, Jialong Wu, Yulong Chen, Zhongqing Wang, and Yue Zhang. “Constituency Parsing using LLMs”. In: ArXiv abs/2310.19462 (2023). url: https://api.semanticscholar.org/CorpusID:264825351. [2] Allyson Ettinger, Jena D. Hwang, Valentina Pyatkin, Chandra Bhagavatula, and Yejin Choi. “"You Are An Expert Linguistic Annotator": Limits of LLMs as Analyzers of Abstract Meaning Representation”. In: Conference on Empirical Methods in Natural Language Processing. 2023. url: https://api.semanticscholar.org/CorpusID:264555617. [3] Charles J. Fillmore. “Frames and the semantics of understanding”. In: Quaderni di Semantica 6.2 (1985), pp. 222–254. [4] Daniel Gildea and Dan Jurafsky. “Automatic Labeling of Semantic Roles”. In: Annual Meeting of the Association for Computational Linguistics. 2000. url: https://api.semanticscholar.org/CorpusID:62182406. [5] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In: Transactions of the Association for Computational Linguistics 8 (2019), pp. 64–77. url: https://api.semanticscholar.org/CorpusID:198229624. [6] Meghana Kshirsagar, Sam Thomson, Nathan Schneider, Jaime G. Carbonell, Noah A. Smith, and Chris Dyer. “Frame-Semantic Role Labeling with Heterogeneous Annotations”. In: Annual Meeting of the Association for Computational Linguistics. 2015. url: https://api.semanticscholar.org/CorpusID:11727829. [7] Irene Langkilde-Geary and Kevin Knight. “Generation that Exploits Corpus-Based Statistical Knowledge”. In: Annual Meeting of the Association for Computational Linguistics. 1998. url: https://api.semanticscholar.org/CorpusID:2680971. [8] Boda Lin, Xinyi Zhou, Binghao Tang, Xiaocheng Gong, and Si Li. “ChatGPT is a Potential Zero-Shot Dependency Parser”. In: ArXiv abs/2310.16654 (2023). url: https://api.semanticscholar.org/CorpusID:264451618. 21 [9] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Annual Meeting of the Association for Computational Linguistics. 2004. url: https://api.semanticscholar.org/CorpusID:964287. [10] Zhichao Lin, Yueheng Sun, and Meishan Zhang. “A Graph-Based Neural Model for End-to-End Frame Semantic Parsing”. In: Conference on Empirical Methods in Natural Language Processing. 2021. url: https://api.semanticscholar.org/CorpusID:237941027. [11] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: International Conference on Learning Representations. 2017. url: https://api.semanticscholar.org/CorpusID:53592270. [12] Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. “Cross-Task Generalization via Natural Language Crowdsourcing Instructions”. In: Annual Meeting of the Association for Computational Linguistics. 2021. url: https://api.semanticscholar.org/CorpusID:237421373. [13] Ayush Pancholy, Miriam R. L. Petruck, and Swabha Swayamdipta. “Sister Help: Data Augmentation for Frame-Semantic Role Labeling”. In: ArXiv abs/2109.07725 (2021). url: https://api.semanticscholar.org/CorpusID:237532273. [14] Hao Peng, Sam Thomson, Swabha Swayamdipta, and Noah A. Smith. “Learning Joint Semantic Parsers from Disjoint Data”. In: ArXiv abs/1804.05990 (2018). url: https://api.semanticscholar.org/CorpusID:4956705. [15] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: ArXiv abs/1910.10683 (2019). url: https://api.semanticscholar.org/CorpusID:204838007. [16] Alexis Ross, Tongshuang Sherry Wu, Hao Peng, Matthew E. Peters, and Matt Gardner. “Tailor: Generating and Perturbing Text with Semantic Controls”. In: Annual Meeting of the Association for Computational Linguistics. 2021. url: https://api.semanticscholar.org/CorpusID:235898904. [17] Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, Collin F. Baker, and Jan Scheffczyk. FrameNet II: Extended Theory and Practice. ICSI: Berkeley, 2016. [18] Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. “FrameNet II: Extended theory and practice”. In: 2006. url: https://api.semanticscholar.org/CorpusID:62163005. [19] Swabha Swayamdipta, Sam Thomson, Chris Dyer, and Noah A. Smith. “Frame-Semantic Parsing with Softmax-Margin Segmental RNNs and a Syntactic Scaffold”. In: ArXiv abs/1706.09528 (2017). url: https://api.semanticscholar.org/CorpusID:3170713. [20] Marilyn A. Walker, Owen Rambow, and Monica Rogati. “SPoT: A Trainable Sentence Planner”. In: North American Chapter of the Association for Computational Linguistics. 2001. url: https://api.semanticscholar.org/CorpusID:7608649. 22 [21] Weizhe Yuan, Graham Neubig, and Pengfei Liu. “BARTScore: Evaluating Generated Text as Text Generation”. In: ArXiv abs/2106.11520 (2021). url: https://api.semanticscholar.org/CorpusID:235593404. [22] Ce Zheng, Yiming Wang, and Baobao Chang. “Query Your Model with Definitions in FrameNet: An Effective Method for Frame Semantic Role Labeling”. In: ArXiv abs/2212.02036 (2022). url: https://api.semanticscholar.org/CorpusID:254247282. 23
Abstract (if available)
Abstract
Despite the mounting evidence for generative capabilities of language models in understanding and generating natural language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Nevertheless, we discover that generated frame-semantic structured data is ineffective at training data augmentation for frame-semantic role labeling. Our study concludes that while generating high-quality, semantically rich data might be within reach, their downstream utility remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Building generalizable language models for code processing
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Learning the semantics of structured data sources
PDF
Aggregating symbols for language models
PDF
Towards generalized event understanding in text via generative models
PDF
Identifying and mitigating safety risks in language models
PDF
Common ground reasoning for communicative agents
PDF
Grounding language in images and videos
PDF
Building a knowledgebase for deep lexical semantics
PDF
Robust and generalizable knowledge acquisition from text
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Generating psycholinguistic norms and applications
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Fairness in natural language generation
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Neural creative language generation
Asset Metadata
Creator
Cui, Xinyue
(author)
Core Title
Annotating FrameNet via structure-conditioned language generation
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/14/2024
Defense Date
04/22/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained decoding,data augmentation,frame semantics,large language models,natural language processing,OAI-PMH Harvest,semantic role labelling
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Swayamdipta, Swabha (
committee chair
), Jia, Robin (
committee member
), Zhao, Jieyu (
committee member
)
Creator Email
fionac411@g.ucla.edu,xinyuecu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113932630
Unique identifier
UC113932630
Identifier
etd-CuiXinyue-12912.pdf (filename)
Legacy Identifier
etd-CuiXinyue-12912
Document Type
Thesis
Format
theses (aat)
Rights
Cui, Xinyue
Internet Media Type
application/pdf
Type
texts
Source
20240515-usctheses-batch-1150
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
constrained decoding
data augmentation
frame semantics
large language models
natural language processing
semantic role labelling