Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
(USC Thesis Other)
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT LEARNING: EXPLORING COMPUTATIONAL AND DATA-DRIVEN TECHNIQUES FOR EFFICIENT TRAINING OF DEEP LEARNING MODELS by Hassan Hamad A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2024 Copyright 2025 Hassan Hamad Acknowledgements I would like to first and foremost thank my amazing supervisor Professor Keith Chugg. This work would not be possible without his constant support, guidance and readiness to help. Most importantly, he taught me how to think and conduct my work as a researcher. It’s been a very rewarding experience working alongside him over the years. I’d also like to thank my fantastic collaborators across all my projects—working together has been both productive and a lot of fun. Your insights and dedication have been invaluable to this research. A big thank you as well to my committee members for their time, guidance, and feedback throughout this process. I also want to express my appreciation to USC for awarding me the Annenberg fellowship upon my arrival. This provided the freedom I needed to find my path and focus on the areas I was most passionate about. Additionally, my heartfelt thanks to all the exceptional faculty and staff at USC, who have always been available for advice and support. I’m also truly grateful to my industry sponsor, V-Labs, for their financial support and for the opportunity to work with a great team over the years. Finally, I would like to express my deepest gratitude to my family, especially to my wife Meriem, whose unwavering support, patience, and love have been my foundation throughout this journey. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Two different Regimes: Compute-Bound and Data-Bound . . . . . . . . . . . . . . . 1 1.2 The Problem of Expensive Compute . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 The Problem of Limited Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Reducing Training Computational Complexity . . . . . . . . . . . . . . . . . . . . 7 2.1 Node Processing and Max-Sum Networks . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Approximation Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Training with Logarithmic Number System (LNS) . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 LNS Definition and Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 QAA-LNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.5 Hardware Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Techniques for Training in Low-Resource Data Settings in NLP . . . . . . . . . . 34 3.1 Two Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.2 LLM Tool-Use (Function-Calling) . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 FIRE and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Introduction and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 FIRE Dataset License and Intended Use . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 Data Splits and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.4 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.5 Labeling Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 iii 3.3 Preliminary Work: Synthetic Data for NER . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Synthetic Data and Knowledge Distillation . . . . . . . . . . . . . . . . . . . 54 3.3.2 The Partial Annotation Problem . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.3 Mask the Unknown Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 ToolCritic: Improving LLM Tool-Use in Multi-Turn Dialogue Applications . . . . . . 61 3.4.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.2 Dataset Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.3 ToolCritic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 4: Concluding Remarks and Suggested Future Work . . . . . . . . . . . . . . . . . . 76 4.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Future Work: Synthetic Data for RE . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2 Prompting using Entity Pair Combinations . . . . . . . . . . . . . . . . . . . 77 4.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Max-Sum Networks Theorem Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 FIRE Dataset Statistics and Experimental Details . . . . . . . . . . . . . . . . . . . . . . 112 B.1 Distribution of entity and relation types in FIRE . . . . . . . . . . . . . . . . . . . . 112 B.2 Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.3 Llama 2-7b and GPT 3.5 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.3.1 Few-Shot Learning Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.3.2 Fine-Tuning Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 FIRE Dataset Annotation Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.1 Summary List of Named Entities and Relations . . . . . . . . . . . . . . . . . . . . . 119 C.2 Definition of Named Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.3 Definition of Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 C.4 Detailed Entity Annotation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.5 Detailed Relation Annotation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.6 General Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 ToolCritic Prompt Templates and Training Details . . . . . . . . . . . . . . . . . . . . . . 180 D.1 Synthetic Data Generation Prompt and Details . . . . . . . . . . . . . . . . . . . . . 180 D.1.1 Demonstration Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 D.1.2 Hand-Writing Demonstration Examples . . . . . . . . . . . . . . . . . . . . . 184 iv D.1.3 Few-Shot Data Generation Prompt . . . . . . . . . . . . . . . . . . . . . . . . 187 D.2 Supervised Fine-Tuning Prompt and Training Details . . . . . . . . . . . . . . . . . . 190 D.2.1 Fine-Tuning Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 D.2.2 Fine-Tuning Prompt Template . . . . . . . . . . . . . . . . . . . . . . . . . . 192 D.3 Performance difference on Error Categories in Low-Label Regime . . . . . . . . . . . 193 D.4 Illustrative Examples of ToolCritic’s Generalization Ability . . . . . . . . . . . . . . 195 D.4.1 Scenario 1: Non-Invocation Confirmation Error . . . . . . . . . . . . . . . . . 196 D.4.2 Scenario 2: Premature Invocation Error . . . . . . . . . . . . . . . . . . . . . 197 v List of Tables 2.1 Different node operations studied in this work . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Comparison of linear fixed-point and log fixed-point at different bitwidths. X-YF bitwidth representation refers to an X bitwidth number with Y dedicated fractional bits. For the log representation, this is inclusive of the extra real sign bit. ULP is the Unit of Least Precision. In this context, it refers to the smallest positive representable value. Note that for linear fixed-point, since resolution is fixed, ULP and resolution are equivalent. The calculations for the log format assume a logarithm base 2. . . . . 15 2.3 List of network modifications and extra operations used in fixed-point training literature. Note that our work does not employ any of these techniques which complicate the training pipeline and/or add additional hardware and computation overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Comparing QAA-LNS with recent Logarithmic Number System (LNS)-based designs. This Table is adopted from [141]. We extend the table with the last three rows to highlight the difference between our design and previous ones. ‘Quantization-Aware Arithmetic’ refers to optimizing the arithmetic to each unique bitwidth. The ‘Drop-in Replacement’ feature is the ability to use the LNS design in-place of floating-point in all network operations while maintaining the same network and hyper-parameters, and achieving comparable accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Median of validation accuracy for the last 10 training epochs of 32-bit floatingpoint (baseline) vs fixed-point LNS using the VGG Models. The LNS bitwidth representation, T-bit (F, o) emphasizes that LNS arithmetic units such as addition and multiplication act on T-bit integers with F dedicated fractional bits. o is the number of overhead flags used in the representation. . . . . . . . . . . . . . . . . . . 27 2.6 Median of validation accuracy for the last 10 training epochs of 32-bit floating-point (baseline) vs fixed-point LNS using the ResNet-18 models. . . . . . . . . . . . . . . . 27 2.7 Standard LNS vs QAA-LNS. Median of validation accuracy for the last 10 training epochs of 32-bit floating-point (baseline) vs fixed-point QAA-LNS and standard LNS using the ResNet-18 model on TinyImageNet. . . . . . . . . . . . . . . . . . . . . . . 29 vi 2.8 Area and power consumption for a MAC based on different number systems. The bitwidth listed below refers to the total number of bits used by any number representation. For QAA-LNS, this is T + o where T is the number of arithmetic bits (to represent the log magnitude) and o = 2 is the number of overhead flag-bits for the sign and zero-flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.9 Ablation Studies on CIFAR-100 and TinyImageNet . . . . . . . . . . . . . . . . . . . 32 3.1 Comparison of FinRED, KPI-EDGAR, and FInancial Relation Extraction (FIRE) datasets. FIRE has the advantage over FinRED in that it is hand-annotated and over KPI-EDGAR in that it is larger, has diverse relations and is more comprehensive in terms of covering financial aspects over a business. Note that FinRED statistics for entity and relation mentions were not readily available. The figures included below were manually computed after a review of the FinRED data files. . . . . . . . . . . . 41 3.2 Inter-annotator micro F1 scores. Annotators A and B are engineers familiar with the NER/RE task. Annotator C had no prior familiarity with the NER/RE task nor any expertise in engineering, finance, or linguistics. . . . . . . . . . . . . . . . . . . . . . 45 3.3 Performance of all models on the FIRE test data. Mean and standard deviation (in superscript) are reported for micro F1 score for both entities and relations. SpERT, PL-Marker, and REBEL are evaluated in two settings: Standard Fine-Tuning and Curriculum Learning. Llama 2-7b and GPT 3.5 are evaluated in a few-shot setting as well as in a standard fine-tuning setting. Note that the REBEL model does not compute entity metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 NER Performance on CoNLL04 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 NER Performance on SciERC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Main results of the experimental evaluation on the SGD test data. Each experiment setup is repeated twice with random seeds. The mean and standard deviation (in superscript) are reported for each metric. . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 Evaluation results of Relation Extraction (RE) on the CONLL04 dataset. Reported values are mean and standard deviation (in superscript) of micro F1 score. . . . . . . 83 4.2 Evaluation results of RE on the SciERC dataset. Reported values are mean and standard deviation (in superscript) of micro F1 score. . . . . . . . . . . . . . . . . . . 83 B.1 FIRE Dataset Entity Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.2 FIRE Dataset Relation Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.3 Selected hyper-parameters for standard fine-tuning. Note that PL-Marker has a separate training run for its NER module. Therefore, we search for the learning rate and batch size of this module as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 vii B.4 Hyper-parameters for curriculum learning experiments. Note that for PL-Marker, we apply curriculum learning on the RE module only. For the NER module, we fix the learning rate to 5e − 5 and the batch size to 4. . . . . . . . . . . . . . . . . . . . . . 113 viii List of Figures 1.1 Evolution in Model Complexity measured through Parameter Count from 2018 to 2023: This figure tracks the increase in the number of parameters of some of the notable state-of-the-art language model, in billions..∗ . . . . . . . . . . . . . . . . . . 2 1.2 Comparative Overview of Compute and Data Bound Regimes in Deep Learning: On the left, Compute Bound focuses on the challenges of processing vast datasets with very large models highlighting the need to minimize computational expense. On the right, Data Bound emphasizes the goal of achieving high accuracy with smaller, specialized datasets, addressing the complexity and cost of obtaining high-quality annotated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Evolution of Training Compute in Deep Learning Models from 2018 to 2023: This graph illustrates the exponential increase in computational resources, measured in exaFLOPs (1018 floating-point operations), required to train some of the notable state-of-the-art language models over time.† . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 A d hidden layer neural network with scalar input x and scalar output fd(x). The node processing at each unit is donated by the function g(.). Note that g(.) also includes the activation function, when applicable. . . . . . . . . . . . . . . . . . . . . 9 2.2 Plot of the functions ∆+(x) and ∆−(x) with an example of a bit-shift approximation (in green) and a piece-wise linear approximation with power-of-two slopes (in blue). Note that the displayed piece-wise linear curves are intentionally a relatively poor approximation of ∆±(x) to illustrate the nature of the approximation errors. . . . . . 24 2.3 Plot of the validation accuracy of floating-point and LNS on TinyImageNet for multiple experiments. To reach the same level as floating-point, at least 14 ‘arithmetic’ bits are needed. 12-bit LNS suffers from severe degradation in accuracy. ‘Not QA‘ refers to using an approximation that is not ‘quantization-aware’. ‘11-bit QA’ refers to using a ‘quantization-aware’ approximation that was optimized for 11-bit LNS. . . 30 3.1 A labeled sentence from the FIRE dataset and an example of how a Knowledge Graph can be built using the collected labels. . . . . . . . . . . . . . . . . . . . . . . 40 ix 3.2 Scatter plot of labeling time (in seconds) versus the number of relations in the sentence. The marginal distributions and histograms are displayed at the edges of the plot. For sentences with the same number of relations, there is a wide distribution of labeling times, showing how the two quantities are correlated but still provide different information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Stages of data collection: 1) Manually gather relevant sentences. 2) Hand-label them to create a “seed” dataset. 3) Train an RE-specialized model on this dataset. 4) Use the model on new financial content to identify entities and relations. 5) From the model’s output, select sentences with low-confidence predictions to reduce confirmation bias. Remove existing labels from these sentences, manually annotate them, and merge with prior data. Repeat until the desired dataset size is achieved. . 43 3.4 Comparison of F1 scores across each relation type in FIRE between standard fine-tuning and curriculum learning approaches using the SpERT model. The results highlight varying levels of difficulty in relation detection and may suggest an improvement in challenging relations when employing curriculum learning. . . . . . . 52 3.5 ToolCritic evaluates every assistant response in a conversation, whether a tool was called or not (this is not shown in the diagram for space limitations). If ToolCritic detects an assistant error, it will produce a reasoning “thought” which will be fed back to the assistant. The assistant then will revise its own response based on the received feedback and produce an updated response. If ToolCritic does not detect a mistake, the conversation continues without interference. . . . . . . . . . . . . . . . . 62 3.6 Each error category defines a specific mistake, or failure mode, that is common for LLMs when interacting with tools in multi-turn dialogue. This granularity ensures the resulting diagnostic model, ToolCritic, can provide very specific feedback for the assistant LLM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7 Results of ToolCritic on the test data split of the synthetic dataset. Precision, Recall and ROUGE score (a) and confusion matrix on the test dataset when using all training data, i.e. 225 examples per error category (b) . . . . . . . . . . . . . . . . . 68 3.8 “Error Profile” of Claude 3 Sonnet when evaluated on the test data split of the SGD dataset. The diagram presents the distribution of mistakes detected by ToolCritic when Claude 3 Sonnet acts as the assistant LLM. . . . . . . . . . . . . . . . . . . . . 74 D.1 Histogram displaying the length (in tokens) of the input prompt and output response of the fine-tuning training dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 D.2 Confusion matrix on test dataset when training with 10% training data only . . . . 194 x Abstract In the rapidly evolving field of deep learning, the increasing complexity and scale of neural network models pose significant challenges in terms of computational demand and data requirements. As models grow in size, the computational costs associated with training them escalate, making compute efficiency a critical area of research. At the same time, these models are “hungry” for labeled data-resources that are often costly and time-consuming to procure. This is especially true in domain-specific settings, where data scarcity is pronounced and compute is not a concern. These two regimes: compute-bound training and data-bound training, necessitate innovative approaches to accomplish two separate goals: reducing the computational cost of training and training better models with minimal labeled data. The primary goal of this dissertation is to explore and develop novel techniques that enhance training efficiency under the two mentioned regimes. On the compute angle, dot-products computations (also referred to as multiply-accumulate) dominate the cost of training neural networks. Therefore, a fundamental question is first investigated: What type of computations are needed to train neural network models? Is there a more efficient alternative to dot-products? Based on the results of this investigation, an implementation of training neural network models, specifically Convolutional Neural Networks (CNNs), using fixed-point Logarithmic Number System (LNS) follows. The second part of this dissertation tackles the challenges of data scarcity in domain-specific natural language processing (NLP) tasks. We first explore curriculum learning-based strategies to train on limited xi labeled data through a novel difficulty criterion, labeling time. Next, building on the trend of large language models exhibiting enhanced reasoning abilities, this dissertation also investigates the use of synthetic data generation to expand the availability of labeled data. This approach is applied to address downstream NLP tasks such as information extraction and improving the accuracy of tool-use in large language models. xii List of Publications Published W. Chang, H. Hamad and K. M. Chugg, "Approximation Capabilities of Neural Networks using Morphological Perceptrons and Generalizations," 2022 56th Asilomar Conference on Signals, Systems, and Computers. Note: William Chang, an undergraduate USC student, led the work on this paper, with my guidance and support. H. Hamad, A. K. Thakur, N. Kolleri, S. Pulikodan, and K. M. Chugg, "FIRE: A Dataset for Financial Relation Extraction," 2024 Findings of the North American Chapter of the Association for Computational Linguistics (NAACL). Note: I collaborated closely with my co-authors from V-Labs on dataset design, curation, and analysis. Submitted Under Review H. Hamad, D. Qiu, P. A. Beerel and K. M. Chugg, "End-to-End Training of Neural Networks using Quantization Aware Approximate Log Number System (QAA-LNS)," Submitted to the 2025 International Joint Conferences on Artificial Intelligence (IJCAI). xiii Note: My co-authors contributed the technical hardware experimental analysis as well general feedback and support for the rest of the work. H. Hamad, Y. Xu, L. Zhao, and W. Yan, "ToolCritic: Evaluating and Improving the Tool Usage Ability of Large Language Models in Multi-Turn Dialogues," Submitted to the 2025 Association for Computational Linguistics (ACL) Note: Conducted during my internship at Amazon, with collaborators from Amazon providing feedback and high-level direction. xiv Chapter 1 Introduction and Background 1.1 Two different Regimes: Compute-Bound and Data-Bound Neural networks, particularly deep learning models, represent a cornerstone of modern artificial intelligence, powering advancements in various fields of science ranging from natural language processing [35, 92] to computer vision [67, 117, 49], among others [60, 12]. Despite their impressive capabilities, training these models presents considerable challenges. These models learn to perform tasks by training on large amounts of data which makes the process both data-intensive and computationally demanding, often requiring substantial hardware resources and time, particularly as models grow in complexity and size. Recent trends in deep learning have seen an exponential growth in model sizes, with state-of-the-art language models now being in the tens or even hundreds of billions of parameters [24, 14, 122, 121]. Figure 1.1 displays the exponential increase in parameter counts of language models over the years. This huge increase in model sizes is also accompanied by an exponential growth in the computational requirements needed to train these models. In addition, the advanced reasoning abilities of deep learning models prompted their use in different branches of science, usually following the now common pre-training then fine-tuning approach [39, 35]. The caveat here is that this requires ∗Dataset source: Epoch, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epochai.org. Retrieved from: ‘https://epochai.org/data/epochdb/visualization’ [online resource] 1 Figure 1.1: Evolution in Model Complexity measured through Parameter Count from 2018 to 2023: This figure tracks the increase in the number of parameters of some of the notable state-of-the-art language model, in billions..∗ creating high quality domain-specific and task-specific datasets to fine-tune these models on. This process is usually expensive and logistically complex, stemming from the need to train a number of annotators on a new subject. We can thus define two different regimes based on the different challenges and goals: the computebound regime and the data-bound regime. Figure 1.2 illustrates the key point of each regime. The compute-bound regime is usually the case of pre-training large models such as Large Language Model (LLM)s while the data-bound regime is the case of fine-tuning these models to solve specific tasks in narrow domains. Given this view of the field, the goal of this dissertation is to investigate and develop additional techniques and algorithms that address the distinct challenges presented by training deep learning models in these two settings. In the following two sections, a more detailed view of the challenges, goals and current research trends in each of the two regimes is given. 2 Figure 1.2: Comparative Overview of Compute and Data Bound Regimes in Deep Learning: On the left, Compute Bound focuses on the challenges of processing vast datasets with very large models highlighting the need to minimize computational expense. On the right, Data Bound emphasizes the goal of achieving high accuracy with smaller, specialized datasets, addressing the complexity and cost of obtaining high-quality annotated data. 1.2 The Problem of Expensive Compute Recent years have shown the introduction of many “foundational” models, especially in the NLP field [24, 14, 122, 121]. Scaling laws for transformer language models, such as those exemplified by the Chinchilla model, describe how increases in model size, data volume, and computational resources systematically improve performance [52, 61]. With the increase in model sizes comes the increase in training cost. Figure 1.3 showcases this increase in terms of FLOP (Floating Point Operations) count. This in turns leads to increases in energy consumption and memory requirements [103, 56]. The process of pre-training requires huge amounts of data from all domains and is intended at building a powerful general purpose model, rather than a model targeted at a specific task. The main challenge here will be the cost of this training process. Given this setup, there is an obvious need to reduce the computational complexity of training. In response to this, there has been a strong push within the deep learning community to develop techniques aimed at reducing the cost of training. This effort spans several approaches, 3 each targeting different aspects of the training process. On the software side, techniques such as Flash-Attention, which optimizes the attention mechanism to reduce memory and compute overhead, have shown significant promise [28]. Similarly, algorithmic enhancements like quantization and mixed-precision training address the hardware utilization efficiency by allowing neural networks to operate with lower precision arithmetic, reducing the computational load without substantially impacting model accuracy [114, 7]. Further, sparsity techniques, which involve pruning insignificant weights or activating only a subset of neurons during forward and backward passes, effectively decrease the number of operations required during training, leading to faster and less resourceintensive computations [37]. These advancements reflect a broader trend towards more sustainable and scalable AI practices. Figure 1.3: Evolution of Training Compute in Deep Learning Models from 2018 to 2023: This graph illustrates the exponential increase in computational resources, measured in exaFLOPs (1018 floating-point operations), required to train some of the notable state-of-the-art language models over time.† †Dataset source: Epoch, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epochai.org. Retrieved from: ‘https://epochai.org/data/epochdb/visualization’ [online resource] 4 1.3 The Problem of Limited Data Unlike the compute-bound regime, the data-bound regime can be generally attributed to the “finetuning” phase, i.e. assuming a general purpose pre-trained model is already available. In this case, one is interested in using this model to solve a particular task, usually in a specific domain as well. For example, one might want to obtain a good model for extracting relevant information from biomedical literature, such as symptoms, viruses, and relations between them [131, 21]. In this case, a high quality labeled dataset is needed to fine-tune some general purpose model. These datasets are usually much smaller in size which results in the bottleneck being not computational resources but the availability and quality of training data. Labeling domain-specific datasets can be a very logistically complex and expensive process. This usually involves hiring annotators, training them with some reference data, computing their agreement scores and then filtering out any bad annotations. Given this challenge, an important direction would be to study how to best utilize the limited labeled data available such as to maximize the models performance on the target task. To address this, the research community has been developing novel training techniques that maximize learning from limited data. Examples of such techniques include curriculum learning [9, 89, 139], where training examples are not randomly presented but organized in an easier to gradually more complex sequence mimicking a learning progression that can lead to more effective learning. Additionally, semi-supervised learning exploits the, usually larger, available unlabeled data in many domains [40, 134]. By utilizing a small amount of labeled data alongside a larger pool of unlabeled data, semi-supervised techniques significantly expand the effective size of training datasets, enhancing model performance without the need for extensive labeling. A more recent direction is using synthetic data generated by LLMs to train smaller models [75, 77]. Due to the rise of generative models that acquire high reasoning abilities and can achieve human-level performance on many tasks, human-like artificial data can now be generated with minimal cost. This practice can be seen as a form of 5 indirect knowledge distillation [51, 44], i.e. transferring the knowledge for the LLM to a smaller, more compact model. 6 Chapter 2 Reducing Training Computational Complexity 2.1 Node Processing and Max-Sum Networks 2.1.1 Background Deep neural networks are driving the AI revolution. They have led to breakthroughs in various fields ranging from computer vision [67] and natural language processing [35] to protein folding [60] and autonomous driving [12]. The current trend is toward larger models. This has motivated recent work on complexity reductions for both inference and training. Complexity reduction techniques such as pruning [48], sparsity [69], and quantization [65] have been proposed. This is particularly important for models deployed on edge devices that have limited memory and computational resources. Another approach for complexity reduction is to depart from the Multiply Accumulate (MAC) (or sum-product) processing used in standard Artificial Neural Networks (ANNs). Radically different network structures, such as the Spiking Neural Network (SNN) [102, 18], have been proposed. Others have proposed to simply replace sum-product operations with a different, and ideally more efficient, operations [19, 115, 100, 88, 104]. In table 2.1, we list the types of node operations that will be the discussed, where ∨ denotes the max operator. The morphological perceptron replaces the sum-product by a max-sum operation [100, 19]. The use of the max function inherently adds 7 Table 2.1: Different node operations studied in this work Node Operation Sum-Product Equivalent sum-product z = X i (xiyi) max-sum z = _ i (xi + yi) signed max-sum z = _ i ai(xi + yi) max∗ -sum z = max∗ i (xi + yi) LNS z = max∗ ± i (log |xi | + log |yi |) sz = sx ⊕ sy non-linearity to the network. The morphological perceptron was extended to use a signed max-sum operation by adding a binary sign parameter in [115]. In the field of digital communications, and specifically in error correction coding literature, the max∗ -sum operation∗ is widely used in decoding iterative digital codes [93]. This operation can be seen as a natural extension to the max-sum node. Finally, it is well known that the exact equivalent of a sum-product can be implemented in the log domain, i.e. using the LNS. LNS requires the use of max∗ +-sum and max∗ −-sum processing along with tracking the signs of the linear operands. A neural network using LNS can thus implement the same processing as a standard Artificial Neural Network (ANN). Therefore, LNS-based networks inherit the approximation capabilities of standard ANNs. Specifically, a single layer network in LNS with a non-linear activation is also a universal function approximator[55, 54]. Several previous works successfully trained ANNs using LNS and demonstrated that the performance of these LNS-networks is on a par with standard networks on publicly available datasets [88, 104, 5]. To the best of our knowledge, the approximation capabilities of neural networks with these max-like units have not yet been studied. In this work, our goal is to characterize the approximation capabilities of neural networks having a max-sum, ∗max∗ -sum is also known as the Jacobian Logarithm [93] 8 . . Figure 2.1: A d hidden layer neural network with scalar input x and scalar output fd(x). The node processing at each unit is donated by the function g(.). Note that g(.) also includes the activation function, when applicable. signed max-sum or a max∗ -sum nodes. We prove that these kind of networks are not universal approximators. In addition, we characterize the exact set of functions that they can approximate. Consider a scalar-input and scalar-output fully connected neural network with d hidden layers as shown in Fig. 2.1. Denote the node function at each unit by the g(.). Note that we also lump the activation function, if any, in the definition of g(.). The standard MAC-based node processing g(.) with input activation vector x = [x1, x2, .., xn], weight vector w = [w1, w2, .., wn], bias value b, and a non-linear activation function σ(.), is defined as g(x, w, b) = σ(b + Pn i=1 wixi). A well known fact of neural networks is that they are universal approximators. In [54], it was shown that a one-hidden layer neural network f : R k → R l with a bounded and non-constant activation function can uniformly approximate any function h on R k , given that sufficient hidden units are available. This result has been later extended to networks with unbounded activations, such as the rectified linear activation (Relu) [136]. A class of neural networks N : [0, 1]n → R is said to uniformly approximate a given function h : [0, 1]n → R iff for any ϵ > 0, there exists a function N ∈ N such that ∀x ∈ [0, 1]n : |N(x)−h(x)| < ϵ. 9 This class of neural networks is said to be a universal approximator iff it uniformly approximates all continuous functions h : [0, 1]n → R. The universal approximation theorems mentioned above concern neural networks with the standard sum-product or MAC processing. In this work, the approximation capabilities of neural networks with three different non-standard node operations are investigated. Our proof method is to show neural networks with the non-MAC nodes exhibit bounded first derivatives. By the following lemma, this implies that these networks are not universal approximators. Lemma 1 Given a single-input and single-output continuous neural network f : [0, 1] → R. Suppose that a ≤ f ′ (x) ≤ b almost everywhere for a, b ∈ R. Then f can only uniformly approximate functions h : [0, 1] → R such that a ≤ h ′ (x) ≤ b almost everywhere. While this result appears intuitive, the same does not hold for higher order derivatives. That is, if f (n) (x) ∈ [a, b] for some n ≥ 2, this does not mean that f can only universally approximate functions h such that a ≤ h (n) (x) ≤ b. A simple counter example is a sum-product network with Relu activation on the hidden layers, which is known to be a universal approximator [136] but has a bounded second derivative. In Appendix 1, we show that the second derivative of such a network is zero a.e. 2.1.2 Approximation Capabilities In this section we consider the approximation capabilities of neural networks with max-sum processing in place of standard sum-product processing. We also consider two generalizations to max-sum processing. We show that these non-MAC networks are not universal approximators from R → R, and thus they are also not universal approximators for R n → R for any n ≥ 1. We state the results of each case in this section and provide the proofs in appendices. Max-Sum Network 10 In the case of a max-sum network, the node function g(.) in Fig. 2.1 is defined as g(x, w, b) = b ∨ (w1 + x1) ∨ · · · ∨ (wn + xn) = b ∨ _n i=1 (wi + xi) ! (2.1) where ∨ is the max operator, i.e. x ∨ y = max(x, y). The max-sum node is also known as the morphological perceptron [19]. Note that the max operation inherently adds non-linearity to the node and no explicit activation function is used. Next we present our first theorem. Theorem 1 Consider a single-input single-output, d hidden layer neural network with max-sum node processing fd(x) : R → R. Then fd(x) = max(w, w′ + x) for some constants w, w′ ∈ R. Thus, max-sum node processing results in a very limited class of functions that can be realized. In particular, by lemma 1, max-sum networks are not universal approximators. In [19], max-sum layers were combined with standard sum-product layers to obtain effective classifiers. Signed Max-Sum Network In the case of a signed max-sum network, the node function g(.) in Fig. 2.1 is defined as g(x, w, b) = b ∨ a1(w1 + x1) ∨ · · · ∨ an(wn + xn) = b ∨ _n i=1 ai(wi + xi) ! (2.2) where ai ∈ {−1, 1}. Note that the ai ’s can either be learnable network parameters or fixed. For our purposes, this choice is irrelevant to the study of the approximation capability of the network since we consider all ai ∈ {−1, 1}. 11 Theorem 2 Consider a single-input single-output, d hidden layer neural network with signed maxsum node processing fd(x) : R → R. Then f ′ d (x) ∈ {−1, 0, 1} a.e. By Lemma 1 signed max-sum networks are not universal approximators. In fact, by the above theorem, these networks have very limited approximation capabilties. Note also that the max-sum node is a special case of the signed max-sum where all the ai ’s are set to 1. Theorem 1 implies that the derivative of max-sum networks is limited to the set {0, 1} which is a proper subset of the possible derivatives of signed-max-sum networks. Max*-Sum Network To define the max*-sum node, first note the definition of the max* function: max∗ (x, y) = ln (e x + e y ) (2.3) which can be nested as follows max∗ i xi = max∗ (x1, x2, .., xn) = ln (e x1 + e x2 + .. + e xn ) = ln Xn i=1 e x i ! (2.4) In the case of a max*-sum network, the node function g(.) in Fig. 2.1 is defined as g(x, w, b) = σ(max∗ (b, max∗ i xi + wi)) = σ ln e B + Xn i=0 e xi+wi !! (2.5) where σ(x) is any activation function such that σ ′ (x) ∈ [0, 1], e.g. a Relu. 12 The max*-sum node processing may be viewed as doing arithmetic in the log domain. Consider the standard sum-product ci = P i aibi where ai , bi > 0 ∀i. If we let Ci = ln(ci), Bi = ln(bi) and Ai = ln(ai) then Ci = ln X i aibi ! = ln X i e Ai+Bi ! = max∗ i (Ai + Bi) (2.6) This indicates that if all data inputs to the network as well as all the weights are non-negative, then working with max*-sum nodes is the equivalent of working with the standard sum-product node, but in the log domain. This is the reason for introducing a nonlinear activation function for max*-sum networks. Specifically, if a linear mapping with non-negative weights, biases, inputs, and outputs is implemented in the log domain, it would be a network with max*-sum node processing and no activation function. Theorem 3 Consider a single-input single-output, d hidden layer neural network with max*-sum node processing fd(x) : R → R. Then 0 ≤ f ′ d (x) < 1 Again, by Lemma 1 max*-sum networks are not universal approximators. Discussion In the previous subsections we proved that a neural network with a max-sum, signed max-sum or max*-sum node processing is not a universal approximator. If max*-sum were extended to include signed inputs, weights, and biases, one would need to separately track the effects on the sign and magnitude of the quantities. In fact, this is LNS arithmetic. A neural network with LNS arithmetic, and a non-linear activation function, is equivalent to a log-domain sum-product ANN, and thus is also a universal approximator. This suggests that 13 it is unlikely that there exists an ANN using max -like family of computations that is a universal approximator and substantially less complex than an LNS network. We note that a recent pre-print paper, posted after the acceptance of this work, proved that networks with sum-product processing and non-negative weights are not universal approximators [126]. Specifically, standard ANNs with non-negative weights can generate sections with negative slopes only. This is complementary to our result for max*-sum processing, where max*-sum processing can be seen as the log domain equivalent of sum-product with non-negative values. 2.2 Training with Logarithmic Number System (LNS) 2.2.1 Background The field of deep learning has witnessed remarkable advancements in recent years, leading to the development of increasingly larger and more complex neural network models. However, this progress comes at a steep price – the escalating costs associated with training these neural networks. While considerable research has focused on reducing inference costs, particularly through techniques like quantization [23, 32], the high computational expense of training remains a critical challenge. Training neural networks still predominantly relies on floating-point arithmetic due to its ability to handle the wide dynamic range required during the back-propagation learning pass. In response to the rising training costs, researchers have explored methods for training neural networks using fixed-point arithmetic† [46, 130, 8, 125, 43]. These approaches have shown promise in reducing computational expenses but come with limitations. Integer arithmetic suffers from restricted range (see Table 2.2), precision loss and quantization errors introducing bias in the training process. To mitigate these issues, existing implementations often introduce additional hardware components, †While fixed-point representation is a specific form of integer arithmetic, in the context of this work, ‘integer’ and ‘fixed-point’ are used interchangeably to describe the numerical format. 14 Table 2.2: Comparison of linear fixed-point and log fixed-point at different bitwidths. X-YF bitwidth representation refers to an X bitwidth number with Y dedicated fractional bits. For the log representation, this is inclusive of the extra real sign bit. ULP is the Unit of Least Precision. In this context, it refers to the smallest positive representable value. Note that for linear fixed-point, since resolution is fixed, ULP and resolution are equivalent. The calculations for the log format assume a logarithm base 2. Format Quantity Bitwidth 16-4F bits 16-6F bits Linear ULP 0.0625 0.015625 Range [−2048, 2047] [−512, 511] Resolution Fixed Log ULP 2 −1024 2 −256 Range [−2 1024 , 2 1024] [−2 256 , 2 256] Resolution Variable Table 2.3: List of network modifications and extra operations used in fixed-point training literature. Note that our work does not employ any of these techniques which complicate the training pipeline and/or add additional hardware and computation overhead. Technique Reference Not used here? Scaling Techniques (Loss, per-layer, per-tensor) [125] ✓ Rounding schemes (e.g. Stochastic Rounding) [46] ✓ Higher bitwidth Accumulation [43] ✓ Full-precision for quantization-sensitive layers/operations [130] ✓ Integer friendly layers or weight updates [141] ✓ e.g. stochastic rounding [46, 43], or computational overhead, e.g. computing a per-layer scaling parameter [125]. Other studies employ specialized techniques to ensure training stability and/or propose alternative layers or operations that are friendlier to integer arithmetic [8]. Importantly, many of these studies do not perform low-bitwidth end-to-end training, often relying on full precision for sensitive operations and accumulating in higher bitwidths to prevent overflow and preserve numerical accuracy [130, 8, 43] . Amidst these efforts, the LNS has emerged as a compelling alternative [105, 141, 76]. In LNS, the log of the quantities of interest are used. We refer to the original quantities as being in the ‘linear domain’ and the corresponding LNS quantities as being in the ‘log domain’. There are many potential benefits to performing fixed-point arithmetic in the log domain including the elimination 15 of multiplies and support for a wide dynamic range similar to that of floating-point (see Table 2.2). The elimination of multiplies in LNS is important since most computations in training are MAC operations and multiplies dominate the circuit complexity of a MAC operation. While the potential of LNS has been demonstrated to some degree, it presents its own unique challenges, particularly in addressing the addition operation. In LNS, a multiplication in the linear domain translates into a simple addition in the log domain, but conversely, addition in the log domain is more complicated than in the linear domain (see Section 2.2.2). Previous studies have explored the application of the log number system to train neural networks. A notable contribution is the LNS-Madam study [141] in which the design of a new LNS-friendly learning algorithm, dubbed Madam, was proposed. As will be detailed in Section 2.2.1, many previous fixed-point studies do not perform end-to-end fixed-point training using a single level of precision, but rather either rely on full-precision floatingpoint at some point in the network or employ mixed-precision techniques. While this enables the minimization of the bitwidths for various computations, this ‘non-uniformity’ can be undesirable in hardware implementations. For example, many proposed machine learning accelerators employ hardware reuse wherein a single processing element circuit is reused for many computations across various layer types, e.g. [20]. In contrast to existing methods, our work presents a novel approach that serves as a seamless drop-in replacement for floating-point arithmetic in neural network training, which is well-suited for this hardware reuse paradigm. Notably, we retain the ability to utilize the exact same network, learning algorithm, and even the hyper-parameters as floating-point training to achieve comparable performance using one precision format across all the network layers. It is also worth noting that, unlike previous fixed-point approaches, we perform intermediate accumulation in the same precision as the representation itself. For example, a 16-bit unit exclusively operates on 16-bit integers, eliminating the need for high-bitwidth accumulators. This feature allows our LNS computational units to be very efficient, as will be shown in Section 2.2.5. 16 Table 2.4: Comparing QAA-LNS with recent LNS-based designs. This Table is adopted from [141]. We extend the table with the last three rows to highlight the difference between our design and previous ones. ‘Quantization-Aware Arithmetic’ refers to optimizing the arithmetic to each unique bitwidth. The ‘Drop-in Replacement’ feature is the ability to use the LNS design in-place of floating-point in all network operations while maintaining the same network and hyper-parameters, and achieving comparable accuracy. QAA-LNS [141] [105] [87] Support training from scratch? ✓ ✓ ✓ ✗ Weight update precision <16-bit <16-bit 32-bit 32-bit Efficient log-to-linear conversion support ✓ ✓ ✓ ✓ Large-scale evaluation ✓ ✓ ✗ ✓ Accumulation precision <16-bit 24-bit 32-bit 32-bit Quantization-aware approximation ✓ ✗ ✗ ✗ Drop-in replacement to floating-point ✓ ✗ ✗ ✗ The critical advance in our approach is the development of a hardware-friendly, piece-wise linear approximation for the complex addition function encountered in LNS. We constrain this approximation to use power-of-two slopes so that efficient bit-shift operations are used instead of costly multiplications. Critically, our approach is distinguished by its quantization-aware nature, i.e., we optimize our approximation for the specific bitwidth being used. We employ a simulated annealing algorithm to optimize the parameters of this piece-wise linear approximation, guided by a quantization-aware loss function. This optimization can be done offline and thus does not contribute to the cost of training. Our approach will be detailed in Section 2.2.3. To summarize, what distinguishes our work from all previously proposed fixed-point training techniques is that it is true end-to-end single-precision integer training with no extra operations introduced (see Table 2.3). The training process is exactly the same as that when using floatingpoint, except for performing the arithmetic in LNS. We call this feature of our method the ‘drop-in replacement’ feature. This also means that any point in training, we can switch from floating-point 17 to LNS or vice-versa by simply converting the weights and the data, and the training trajectory will not change.‡ Our contributions can be summarized as follows: • We introduce Quantization Aware Approximate - Log Number System (QAA-LNS), an approximate quantization aware log fixed-point arithmetic as a drop-in replacement for floating-point in training neural networks. • The proposed approximation is hardware-friendly and optimized per bitwidth representation. • A C++ bit-true simulation successfully trains a ResNet-18 model from scratch on datasets such as CIFAR-100 and TinyImageNet using significantly lower bitwidth without sacrificing accuracy compared to that of 32-bit floating-point. • We conduct a hardware study revealing up to 32.5% reduction in area and 53.5% reduction in energy consumption for LNS multiply-accumulate (MAC) units when compared to their linear counterparts. Floating-Point Literature. Floating-point is the dominant number system for training neural networks today. In addition to the single-precision format, or fp32, recent years have shown halfprecision format, or fp16, emerging as a viable alternative, with modern accelerators and deep learning frameworks already supporting training with fp16. The reduction of bitwidth by half allows loading and training larger models, using larger batch sizes and utilizing smaller hardware computational units, thereby reducing energy consumption. However, the reduced dynamic range and precision of fp16 are non-trivial and can lead to degradation in a model’s performance. This motivated the development of new floating-point variants and domain specific architectures (DSA) with deep learning applications in mind, such as bfloat16§ which can run on Google’s tensor processing unit (TPU) [59]. Further quantization below 16-bit word-length is a recent focus of the ‡Provided enough precision is used. §The bfloat16 numerical format: https://cloud.google.com/tpu/docs/bfloat16 18 research community with the challenge of combating very significant quantization errors and precision loss. This causes some sensitive operations in network training to fail, e.g. batch-normalization layer, gradient computations and weight updates. Thus, many researches in this field resort to either implementing these quantization-sensitive operations in full-precision or simply avoid such layers, e.g. by not using a batch-normalization layer as part of the network. One of the early studies on 8-bit floating-point training [8] overcame some of these obstacles by developing an alternative to the batch-normalization layer, dubbed ‘Range Batch-Normalization’, which is more friendly to low-precision arithmetic. In addition, 16-bit copies of the gradients were still used in the backward pass for operations not posing a performance bottleneck. [113] studied pushing the quantization further to 4-bit training by introducing a new 4-bit floating-point format. To handle the small dynamic range, a custom per-layer gradient scaling technique was used to keep the gradients inside the representable range. In addition, rounding in this 4-bit format causes a quantization bias which requires introducing a special rounding scheme to alleviate the issue. Fixed-Point Literature. Fixed-point¶ operations (e.g., MACs) can be implemented in circuitry that is much simpler than that required for floating-point operations, which handles the mantissa and exponent separately. This makes fixed-point particularly advantageous in resource-constrained environments because of its lower energy consumption [56]. Thus, fixed-point methods are widely used in many application specific signal processing circuits (e.g., communication modems). This potential complexity reduction has spurred a lot of interest in low-precision fixed-point arithmetic for deep learning. Quantizing neural network inference has been successfully demonstrated for extremely low bitwidths [23, 32]. However, its limited range and precision compared to that of floating-point makes it much more challenging to utilize in training due to the wide dynamic range requirements in the backward pass. [46] is one of the early studies implementing 16-bit fixed-point ¶For a refresher on fixed-point arithmetic, refer to Fixed-Point Arithmetic: An Introduction by Randy Yates https://courses.cs.washington.edu/courses/cse467/08au/labs/l5/fp.pdf 19 weights for training. To mitigate the bias introduced in the rounding of fixed-point numbers, the authors proposed using a stochastic rounding scheme, which has become common in fixed-point literature. In [130], 8-bit integers were used in training but some sensitive layers and operations, such as batch-normalization and soft-max, were left in full precision floating-point. [125] introduce NITI, an integer-only training framework utilizing a per-layer scaling factor for each different quantity in the network: weights, activations, gradients and errors. Thus, the range of representable values for each quantity can be adjusted separately. NITI successfully trains with 8-bit integers but as with most previous works, accumulation is done in higher precision, in this case 32-bit. Another interesting work is that of [43] where the authors manipulate the floating-point format to switch it into 8-bit integers instead of using quantization directly. Still, accumulation is done in higher bitwidth and non-linear operations aside from the ReLu activation remain in floating-point. LNS Literature. LNS traces its origins to foundational works such as [116], [62] and [71]. One of the main drawbacks of fixed-point formats is the limited range due to the resolution being fixed. A fixed step-size in the linear domain becomes a variable step-size in the log domain because of the non-linearity of the log function. Therefore, in LNS, performing the quantization in the log domain produces a variable resolution and an exponentially larger representable range. A representable range of [c, d] in the linear domain becomes [b c , bd ] in the log domain, where b is the logarithm base. While LNS offers significant advantages in terms of range and ULP, it also has potential limitations. In LNS, the non-uniform distribution of quantization levels leads to larger spacing between levels for higher magnitudes. This can result in higher quantization errors for large values compared to fixed-point representations with uniform spacing. However, it is important to note that the impact of these quantization errors on the overall performance of LNS-based systems depends on the specific application and the range of values encountered. To mitigate the potential impact of quantization errors, careful selection of the LNS parameters, such as the number of fractional bits 20 and the logarithm base, is crucial. Additionally, multiplications transform into additions because of the log property: log(ab) = log(a) + log(b). The downside is that the addition operation log(a + b) has no simple form in the log domain. This drives most of the research on LNS to find an efficient approximation for addition [4]. [6] is an early study which noticed significant hardware savings when implementing logarithmic arithmetic in back-propagation. In [105], a simple bit-shift approximation for addition is utilized to train small networks using 16-bit LNS on MNIST-like datasets while accumulating in 32-bit. Our experiments show that a much more precise approximation is needed for large-scale experiments. The closest method to ours is [141] where LNS-Madam is introduced to successfully train using 8-bit LNS, with 24-bit accumulation, on large-scale datasets such as ImageNet [30]. To achieve this result, a multiplicative weight update scheme is introduced, dubbed Madam, as an alternative to Adam where the weight updates are additive. In contrast to these previous LNS proposals, our work does not employ higher bitwidths for accumulation and does not introduce any modification to the network as shown in Table 2.4. 2.2.2 LNS Definition and Arithmetic In this section, we formalize the definition of LNS and describe the arithmetic operations within this system. A real number x in the ‘linear domain’ signifying a quantity in the training process, e.g. a weight or a gradient, can be equivalently represented in the ‘log domain’ by a tuple as follows: x ↔ (log2 (|x|), sx) (2.7) where sx represents the sign of x. Next, we expand this representation to account for quantization, thereby introducing fixed-point LNS. 21 Definition 1 (LNS Fixed-Point Representation) Let x be a real number and let T, I, and F denote the total number of bits, integer bits, and fractional bits, respectively, used in the log magnitude representation of x such that T = I + F. The fixed-point LNS representation of x is expressed as a two-tuple (ℓx, sx) where the sign flag sx = 1 for negative x and sx = 0 for non-negative x and the log-magnitude is ℓx = clip round log2 (|x|) × 2 F , −2 2 I−1 − 1, 2 2 I−1 (2.8) where clip(v, min, max) bounds v within the range specified by min and max, and 2 F serves as the scaling factor, locating the binary point at the desired fixed position. In total, T + 1 bits are needed to represent x in the log domain including the single bit for the sign sx. Multiplication in LNS is straightforward, as it translates to addition of the logarithmic components. Addition in LNS, however, is more involved due to the non-linear nature of the logarithmic mapping. Definition 2 (LNS Multiplication) Given two LNS numbers (ℓx, sx) and (ℓy, sy), the linear product z = x ∗ y is defined in LNS as: (ℓz, sz) = (ℓx + ℓy, sx ⊕ sy) (2.9) where ⊕ denotes the XOR operation on the sign bits, yielding the sign of the product. Definition 3 (LNS Addition) Given two LNS numbers (ℓx, sx) and (ℓy, sy). The linear sum z = x + y is defined in LNS as (ℓz, sz) where: ℓz = log2 (|x + y|) = log2 (|sx · 2 ℓx + sy · 2 ℓy |) = max(ℓx, ℓy) + ∆+(|ℓx − ℓy|) if sx = sy max(ℓx, ℓy) + ∆−(|ℓx − ℓy|) if sx ̸= sy, (2.10) 2 ∆+(d) = log2 (1 + 2−d ) for d ≥ 0 (2.11) ∆−(d) = log2 (1 − 2 −d ) for d ≥ 0 (2.12) where ∆± are defined for ease of notation. The sign bit for the sum is given by sz = sx if ℓx ≥ ℓy and sy if ℓx < ℓy. Approximations. Figure 2.2 depicts the ∆+(d) and ∆−(d) curves, along with two approximation methods (more details later). Because it is not possible to implement the functions in (2.11) and (2.12) exactly in fixed-point, i.e. using integer math, researchers target efficient approximations. This is particularly viable for machine learning applications due to the inherent tolerance to noise, as long as the introduced numerical errors are unbiased. For instance, in [105], a simple bit-shift approximation method, as illustrated in Figure 2.2, is applied to relatively small-scale experiments. While costeffective, our experiments have revealed that more precise approximations are essential for effectively training large-scale models. This is also the case in [4], where the Mitchell’s approximation [86] for the logarithm is used to implement back-propagation in LNS for small networks. In [141] a hybrid approximation approach, combining Mitchell’s approximation with table lookup and subsequent bit-shift operations was adopted to train on large-scale datasets such as ImageNet. This design, coupled with other enhancements, enables successful 8-bit LNS training, although accumulation still relies on 24-bit precision. 2.2.3 QAA-LNS In this section, we introduce the proposed Quantization Aware Approximate - Log Number System (QAA-LNS) approach. This method is distinguished from previous approximation techniques by accounting for quantization in the ∆± approximations. Efficiency. Our design uses a piece-wise linear approximation featuring power-of-two slopes, which 23 Figure 2.2: Plot of the functions ∆+(x) and ∆−(x) with an example of a bit-shift approximation (in green) and a piece-wise linear approximation with power-of-two slopes (in blue). Note that the displayed piece-wise linear curves are intentionally a relatively poor approximation of ∆±(x) to illustrate the nature of the approximation errors. enables computationally expensive multiplications to be replaced by more efficient bit-shift operations. This piece-wise linear curve, exemplified in Figure 2.2, is defined by selection of the bin locations, slopes, and offset values for each segment of the curve. The resulting approximation can be expressed as: ∆(d) ≈ Bit-shift(d, ai) + oi (2.13) where i is the index of the bin ci which input d falls into, ai and oi are the slope and offset associated with segment i, respectively. This approximation is used for both ∆+ and ∆− functions with different bin boundaries, slopes and offsets for each. Accuracy. To obtain an accurate approximation, the bin locations, slopes, and offsets must be optimized. Towards this end, consider two extremes: On the one end, the straightforward method 24 could be to simply minimize the mean squared error (MSE) between the true curve ∆± and the approximate curve, call it ∆ˆ ±, at some N sample points dn, n = 1, .., N as follows: L = 1 N X N n=1 (∆±(dn) − ∆ˆ ±(dn))2 (2.14) While this is straightforward to implement, it does not account for quantization. Our experiments in Section 2.2.6 show how training diverges when applying this ‘non quantization-aware’ approach. On the other extreme, one could conceptually consider optimizing the approximation parameters at a given fixed-point format via end-to-end learning performance. While this accounts for quantization, it is likely too computationally complex and is optimized only for a particular dataset and model. In QAA-LNS, we consider a middle-ground approach. The selection of bin locations, slopes, and offsets is optimized for a general LNS addition operation at a particular bitwidth. Consider two real vectors x and y in the linear domain, e.g. weights or activations. Denote the exact sum as z = x + y. Now define z˜ to be the quantized value of z in the linear domain, i.e. quantize z to LNS as in (2.8) then convert back to linear domain with infinite precision. Thus, z˜ represents the ideal values we would like to obtain when performing LNS addition of x and y . Next, denote zˆ to be the approximate output obtained from LNS addition in (2.10). Define the loss function as: L(x, y) = 1 N X N n=1 (˜zn − zˆn) 2 (2.15) where z˜ and zˆ are implicit functions of (x, y), the bitwidths, and the delta approximation parameters. Finally, we employ the simulated annealing algorithm [63] with a cosine cooling schedule to perform the optimization over the approximation parameters. To generate the neighbor solution at time t in the algorithm, a random bin ci(t) is selected, and the new bin location ci(t + 1) is then sampled from a uniform distribution in the interval [ci−1(t), ci+1(t)]. The algorithm then iterates over a finite 25 list of power-of-two slopes and finds the combination of slope ai(t + 1) and offset oi(t + 1) which minimize the mean squared-error in the segment i. Discussion. By incorporating quantization into our loss function, the resulting approximation is tailored to the intricacies of each bitwidth representation. As we demonstrate in the subsequent experiments in Section 2.2.6, this aspect proves to be crucial for achieving successful convergence during training. For instance, a visual inspection of the approximation curves (comprising bin locations, slopes, and offsets) for 11-bit LNS and 14-bit LNS may not reveal any visible difference. However, these seemingly subtle distinctions arising from simulated annealing are essential in avoiding the accumulation of numerical errors throughout the training process, preventing divergence. Selection of Parameters. For our experimental setup, we employ the following parameters: N = 10, 000, x and y sampled from a N (0, 3) distribution, and a 16-segment piece-wise linear curve over the range d ∈ [0, 12]. The normal distribution is a natural choice for the sample vectors x and y since this resembles the distributions of weights and activations in the network. Nevertheless, multiple distributions were tested and all were found similarly effective. Similarly, the number of segments/bins was varied: specifically 8, 12, 16, and 32, were tested and the size of 16 was found to provide the best balance between complexity and accuracy. Regarding the range, an inspection of Figure 2.2 reveals how both curves tend to zero for large d. Around d = 12, the values of ∆+ and ∆− start to get below the resolution for most low bitwidth precision. Therefore, it is enough to optimize the approximation over the range [0, 12]. 2.2.4 Experimental Results Datasets. We evaluate our method on two popular image datasets: CIFAR-100 [66] and TinyImageNet [70]. CIFAR-100 consists of 60, 000 train images and 10, 000 validation images, spanning a total of 100 classes. TinyImageNet is a subset of the ImageNet dataset [30] with 100, 000 training 26 Table 2.5: Median of validation accuracy for the last 10 training epochs of 32-bit floating-point (baseline) vs fixed-point LNS using the VGG Models. The LNS bitwidth representation, T-bit (F, o) emphasizes that LNS arithmetic units such as addition and multiplication act on T-bit integers with F dedicated fractional bits. o is the number of overhead flags used in the representation. Format Bitwidth VGG-11 on CIFAR-100 VGG-16 on TinyImageNet Acc Degradation Acc Degradation Floating-Point 32-bit 66.03% ± 0.14% - 54.47% ± 0.31% - QAA-LNS 14-bit (F=8, o=2) 66.13% ± 0.24% -0.1% 53.65% ± 0.88% 0.82% 12-bit (F=6, o=2) 65.89% ± 0.45% 0.14% 53.60% ± 0.76% 0.87% 11-bit (F=5, o=2) 59.29% ± 4.77% 6.74% 37.51% ± 3.06% 16.96% Table 2.6: Median of validation accuracy for the last 10 training epochs of 32-bit floating-point (baseline) vs fixed-point LNS using the ResNet-18 models. Format Bitwidth ResNet-18 on CIFAR-100 ResNet-18 on TinyImageNet Mean ± Std Degradation Mean ± Std Degradation Floating-Point 32-bit 71.83% ± 0.22% - 44.53% ± 0.25% - QAA-LNS 14-bit (F=8, o=2) 71.49% ± 0.15% 0.34% 43.92% ± 0.30% 0.61% 12-bit (F=6, o=2) 71.63% ± 0.30% 0.20% 38.79% ± 1.30% 5.74% 11-bit (F=5, o=2) 65.00% ± 3.48% 6.83% 35.39% ± 0.16% 9.15% samples and 10, 000 validation samples from 200 classes of images down-sampled to 64 × 64 pixels. Model and Setup. We use two popular CNN architectures, VGG [110] and ResNet [49], for our experiments. For TinyImageNet, we train ResNet-18 and VGG-16 models, while for CIFAR-100, we train VGG-11 and the CIFAR version of ResNet-18 (with a 3 × 3 convolution replacing the initial 7 × 7 convolution and removing the MaxPooling layer to better suit the CIFAR-100 image scale). All layers and operations are implemented in fixed-point LNS. We use RandomFlip and RandomCrop augmentations with 4 pixel padding for 32 × 32 resolutions on CIFAR-100 and 56 × 56 resolution on TinyImageNet. A hyper-parameter search was performed using floating-point and the same settings were then used for LNS: SGD with weight decay of 0.0001, momentum of 0.9, batch size of 128, and a Cosine Scheduler with Warm Restarts [78], starting at learning rates of 0.1 for CIFAR-100 and 0.5 for TinyImageNet. Each model was trained for 100 epochs only. Extended training in LNS is computationally intensive. LNS setting. In our training runs, we incorporate an additional ‘zero-flag’ overhead bit in our 27 LNS representation. This bit serves the purpose of precisely representing exact zero values since the logarithm of zero is undefined. Another approach is to represent zero with the smallest representative value allowed in the corresponding bitwidth. This is studied and shown to be viable in Section 2.2.6, i.e. it is possible to drop the zero-flag bit without sacrificing accuracy. In addition, the soft-max function in the final layer requires computing an exponential. Since this is only used once in the final layer, not posing a performance bottleneck, we utilize a relatively complex piece-wise linear approximation, similar to that introduced in Section 2.2.3, consisting of 256-segments. Finally, three LNS bitwidth configurations were chosen for our experiments, each with o = 2 flag bits, one for the sign flag and another for the zero-flag: T = 14-bit (with F = 8 fractional bits + o = 2 flag bits), T = 12-bit (with F = 6 fractional bits + o = 2 flag bits), and T = 11-bit (with F = 5 fractional bits + o = 2 flag bits). We refer to T as the arithmetic bitwidth since, for example, the addition (ℓx + ℓy) in (2.9) is performed by a standard two’s complement T bit adder. In other words, the sign bit and zero-flag bit have negligible effect on the LNS MAC circuit complexity and we therefore refer to these schemes as ‘T-bit QAA-LNS.’ Our objective is to present cases where training reaches floating-point levels as well as those leading to significant degradation in accuracy, i.e. failure cases. Evaluation. In each experimental configuration, we conduct two training runs utilizing different random seeds and report the mean and standard deviation of the validation accuracy. Our chosen metric for evaluation is the median of the validation accuracy over the last 10 epochs. Each of the experiments was conducted on either an NVIDIA RTX 2080 Ti GPU or an NVIDIA RTX 3090 GPU. No deep learning framework was used. To run LNS on a GPU, custom CUDA kernels were written. Results The results from our experiments with VGG and ResNet models are presented in Tables 2.5 and 2.6, respectively. We also depict validation accuracy curves from our experiments in Figure 2.3. Using VGG models, training on CIFAR-100 and TinyImageNet with 12-bit arithmetic shows negligible accuracy degradation compared to the 32-bit floating-point baseline. Accuracy declines when 28 Table 2.7: Standard LNS vs QAA-LNS. Median of validation accuracy for the last 10 training epochs of 32-bit floating-point (baseline) vs fixed-point QAA-LNS and standard LNS using the ResNet-18 model on TinyImageNet. Format Bitwidth ResNet-18 on TinyImageNet Acc Degradation Floating-Point 32-bit 44.53% - QAA-LNS 14-bit (F=8, o=2) 43.92% 0.61% Non QAA-LNS 14-bit (F=8, o=2) 30.09% 14.44% 18-bit (F=12, o=2) 40.74% 3.79% 20-bit (F=14, o=2) 40.33% 4.20% using 11-bit arithmetic. For ResNet, similar results are seen on CIFAR-100, while TinyImageNet required 14-bit arithmetic to maintain accuracy. On both datasets, training with lower bitwidths led to divergence, highlighting that larger models and datasets need higher precision for effective training. Standard vs QAA-LNS Finally, to prove the value of incorporating bitwidth into the design, we compare the standard LNS design vs QAA-LNS. We train a ResNet-18 model on TinyImageNet for 100 epochs with a non-bitwidth-specific design. The results are displayed in Table 2.7. These additional experiments show that a 14-bit QAA-LNS achieves better accuracy than even a 20-bit non-bitwidth-specific LNS. This demonstrates at least a 6-bit gain solely attributable to our design! Software Methods and Comparisons. The goal of this work is to demonstrate the potential of LNS as a ‘drop-in’ replacement to floating-point in neural network training without introducing any additional modifications to the network. Therefore, the entire LNS code setup had to be written from scratch in C++ and CUDA; no deep learning framework could be used. This also ensures that accumulation is implemented in the same bitwidth representation, i.e. no bitwidth expansion is performed. In addition, while LNS is proposed as a cheaper alternative to floating-point, this requires specialized hardware to support the proposed LNS arithmetic in (2.10), (2.13), and (2.9). Even-though LNS CUDA kernels were written, the LNS operations cannot effectively utilize the GPU CUDA cores since these cores are optimized to perform ‘linear’ multiply-accumulate. For example, a 29 0 20 40 60 80 100 Epoch 0.0 0.1 0.2 0.3 0.4 Validation Accuracy Floating Point 32-bit LNS 14-bit (F=8 + o=2) LNS 12-bit (F=6 + o=2) Ablation: LNS 14-bit (F=8 + o=2) Not QA Ablation: LLNS 14-bit (F=8 + o=2) 11-bit QA Figure 2.3: Plot of the validation accuracy of floating-point and LNS on TinyImageNet for multiple experiments. To reach the same level as floating-point, at least 14 ‘arithmetic’ bits are needed. 12-bit LNS suffers from severe degradation in accuracy. ‘Not QA‘ refers to using an approximation that is not ‘quantization-aware’. ‘11-bit QA’ refers to using a ‘quantization-aware’ approximation that was optimized for 11-bit LNS. typical LNS training epoch on CIFAR-100 takes over an hour to complete. Consequently, conducting our experiments for more extensive datasets and models, such as ImageNet, or for prolonged epochs, becomes computationally prohibitive. Considering that much of the literature on low-bitwidth training either trains on ImageNet or conducts experiments for a very large number of epochs, a direct comparison with the literature is difficult. Nevertheless, 32-bit floating-point provides an upper bound on accuracy and the goal of this work is not to eclipse the state of the art in minimizing the training bitwidth. In fact, prior studies have already demonstrated successful training with 8-bit representations [125, 8, 141, 43] utilizing the modifications in Table 2.3. However, this work can serve as a staring point for research in minimizing LNS network training in low-bitwidth, perhaps employing modification similar to those in Table 2.3. By way of comparison, [46] is an early study on linear fixed-point training and they introduced minimal network modifications. That study found 30 Table 2.8: Area and power consumption for a MAC based on different number systems. The bitwidth listed below refers to the total number of bits used by any number representation. For QAA-LNS, this is T + o where T is the number of arithmetic bits (to represent the log magnitude) and o = 2 is the number of overhead flag-bits for the sign and zero-flag. Area (µm2 ) / Power (µW) QAA-LNS MAC Fixed-Point INT MAC Floating-Point MAC 14-bit 2051.7 / 135.6 (T = 12) 2736.4 / 237.7 - 16-bit 2599.9 / 173.3 (T = 14) 3494.8 / 318.2 5259.9 / 400.6 18-bit 3215.1 / 220.0 (T = 16) 4341.9 / 413.8 - 20-bit 3745.9 / 263.7 (T = 18) 5280.5 / 522.1 - 22-bit 4258.4 / 301.6 (T = 20) 6309.2 / 648.7 - that fixed-point training in the linear domain required 20 bits, with 32-bit accumulation, for a stable training on large-scale datasets. Thus, our work demonstrates that QAA-LNS is significantly more efficient in terms of bitwidth than standard linear domain fixed-point training while also maintaining a single fixed-point format allowing for potential hardware reuse. A fruitful direction for future research is pushing the limits of bitwidth reduction in LNS using quantization aware approximations and modifications similar to those in Table 2.3. 2.2.5 Hardware Study We implement QAA-LNS-based MACs with different bitwidth configurations at the RTL level. Linear fixed-point (integer) MACs as well as floating-point MACs are taken as baseline designs. Given the large dynamic range of LNS and floating-point formats, overflow is uncommon in training. Consequently, we set the outputs of LNS MACs and the floating-point MACs to the same bitwidth as the inputs. For the INT MACs we followed the standard practice of using the multiplier output bitwidth of twice the input bitwidth to avoid overflow. Table 2.8 summarizes the area and power consumption of the proposed QAA-LNS MACs and baseline designs as a function of input bitwidth synthesised using an open source free 45nm cell library at 1.0V and 100MHz. The LNS MACs demonstrate substantial benefits, saving up to 32.5% in area and 53.5% in power, over INT MACs for the same input bitwidth. This advantage diminishes with decreasing bitwidth. 31 Nevertheless, even in a 16-bit configuration, the LNS MAC exhibits noteworthy benefits, saving up to 25.6% in area and 45.5% in power, compared to the INT counterpart, and saving 50.5% in area and 56.7% in power compared to the FP16 counterpart. Based on our training experiments and the results from [46], it is reasonable to compare QAA-LNS at 16-bit (T = 14) to the linear MAC at 20-bit, which is a reduction in power and area of 66.8% and 50.7%, respectively. These results suggest a compelling case for significant overall area and energy benefits for QAA-LNS in neural network training. Lastly, we note that almost half of the logic in the QAA-LNS MAC is required to determine the bin to which the input belongs during the computation of the piece-wise linear approximation for ∆±. Further optimization of these boundaries to be more hardware-friendly may yield further benefits. Table 2.9: Ablation Studies on CIFAR-100 and TinyImageNet Study Dataset Accuracy FP 32-bit TinyImageNet 44.53% LNS 14-bit (Not QA) TinyImageNet 30.40% LNS 14-bit (11-bit QA) TinyImageNet 29.09% FP 32-bit CIFAR-100 71.83% LNS 14-bit (No zero-flag) CIFAR-100 72.01% 2.2.6 Ablation Studies A total of three ablation experiments were conducted. To demonstrate the effectiveness of the proposed ‘quantization-aware’ approximation, the following two cases were considered on TinyImageNet: Run T = 14-bit LNS with an approximation that does not incorporate quantization, i.e. approximating the ∆± curves directly as in (2.14). Next, use the approximation optimized for T = 11-bit LNS for training with T = 14-bit representation. Results are displayed in Table 2.9 and the learning curves are depicted in Figure 2.3. A severe degradation compared to the floating-point baseline can be seen in both cases. In the case of using the non quantization-aware approximation, the gradients of the network blow up in magnitude and cause numerical instability. These two results not only show that 32 incorporating quantization effects is crucial to ensure numerical stability but also emphasize that the approximation must be tailored for each specific bitwidth separately. Finally, we conduct an experiment on CIFAR-100 using T = 14-bit LNS and without adding a zero-flag bit, i.e. with F = 8 fractional bits + o = 1 flag bit. The result in Table 2.9 shows that the zero-flag bit had no benefit in this particular setting and can be dropped (this is yet to be explored for lower bitwidths). 33 Chapter 3 Techniques for Training in Low-Resource Data Settings in NLP 3.1 Two Applications In the data-bound regime, compute is no more a major concern. The main bottleneck to obtain better models is acquiring high-quality labeled data. As discussed in the introduction, this thesis focuses on developing techniques to tackle this data scarcity issues on specific downstream NLP tasks. Two main tasks are the focus of this work: information extraction and LLM tool-use. Before diving in our contributions to these problems, the next two subsections provide a brief definition and background of each of the two tasks. 3.1.1 Information Extraction Natural Language Processing (NLP) has increasingly become a pivotal area in the field of artificial intelligence, especially with the introduction of transformer-based language models (LLMs) such as BERT and GPT. These models have pushed the boundaries of what machines can achieve with human language and have shown to achieve superior performance on an array of NLP tasks compared to previous models. NLP is characterized by its diversity of tasks, each with distinct input-output structures and objectives. Examples include text classification, which simply classifies text into predefined labels; machine translation, which converts text from one language to another; text 34 summarization, which summarizes lengthy texts into concise statements; information extraction, which aims at extracting useful information from unstructured text. Within this broad spectrum, Named Entity Recognition (NER) and RE are subtasks under the umbrella of information extraction. NER involves identifying textual spans that correspond to entities such as names, locations, or organizations and classifying them into certain categories. RE builds upon the entities identified by NER to determine and categorize the relationship between pairs of entities. These tasks can be approached as “open” or “closed”. Open NER and RE are exploratory, aiming to identify any potential types of entities or relationships directly from the text. In contrast, the goal of closed NER and RE, which this thesis focuses on, is to extract information from a certain predefined set of categories or labels. Formally, NER can be defined as a sequence labeling problem where each token wi in a sequence S = w1, ..., wn is assigned a label li from a set of predefined labels L. Note that usually we are interested in labeling spans of texts and not individual tokens, but for simplicity we consider individual tokens here. Each label represents an entity type, and the task involves classifying each token according to these types. The problem can be mathematically represented as: ˆl = arg max L P(l|wi , S) (3.1) where ˆl is the predicted label for the token wi inside S, and P(l|wi , S) represents the probability of entity label l given the token wi and the input sequence S. Models typically use algorithms like neural networks to model these probabilities. Relation Extraction RE, on the other hand, focuses on identifying semantic relationships between pairs of entities within a text. Once entities are recognized by an NER solution, RE is the task of finding the types of relationships, if any, that exist between pairs of identified entities. This task can be framed as a classification problem where each pair of entities (ei , ej ) extracted from the text 35 is given a relationship label r from a predefined set of relationships R, or is given the “no relation” label. This process can be represented mathematically as: rˆ = arg max r P(r|ei , ej , S) (3.2) where rˆ is the predicted relationship type for the entity pair (ei , ej ), and P(r|ei , ej , S) is the probability of relationship r given the entities ei and ej and the sentence S. NER and RE are inherently interconnected tasks that can be solved either jointly or separately. For example, a single model can be used to, given a text input, find all entities and then discover all relations between pairs of the extracted entities. Another way to approach this is to train a model to solve each task separately. Since the rise of deep learning and foundational language models, the common approach to solve these tasks is to fine-tune a pretrained model. In this framework, we do not train a model from scratch. Instead, we use an existing pretrained model, such as BERT, and fine-tune it on our task. This usually involves adding some additional layers such as classification heads, and requires adjusting the training process and objective in a manner suitable to the task. We can refer to these kind of models as NER and RE “specialized” models, meaning that the structure and training process of these models were specifically developed to solve these tasks. More recently, with the development of LLMs such as GPT 3.5, researchers have been exploring using general-purpose LLMs directly to solve the task, without developing new training algorithms or making modifications to the existing model architectures. The advanced capabilities of these generative models allows us to simply describe the interned objective in the model’s prompt and either train it to extract the desired outputs, or even simply use the model in a zero-shot or few-shot setting. 36 3.1.2 LLM Tool-Use (Function-Calling) Large language models (LLMs) have advanced significantly, making them a viable option for conversational AI. To extend the capabilities of LLMs beyond simple conversation, integrating external tools—commonly known as function-calling—is essential. These tools can greatly enhance the functionality of LLMs, enabling them to perform complex tasks such as booking reservations or retrieving real-time information, thereby broadening the scope of LLM applications. However, LLMs are known to be imperfect, often exhibiting deficiencies like reasoning errors and hallucinations. These issues become even more pronounced when LLMs interact with external tools. Although function-calling has gained increasing attention in the development of foundational LLMs [38, 1, 91, 3], many models are not specifically pretrained for this task. As our results will show, even LLMs equipped with function-calling capabilities often struggle with accurately predicting the correct tool to use, mapping user intent to tool arguments, and effectively using tools in general. To address the reliability issues of LLMs, two primary research directions have emerged: intrinsic self-correction [82, 109] and correction using external feedback [83, 45]. While approaches like Reflexion [109] have shown promising results in enabling LLMs to correct their own reasoning errors, subsequent studies have cast doubt on these findings. For example, [58] attributed much of the improvement in earlier self-correction work to poor baselines, unfair comparisons, and suboptimal prompt design. Another study [123] further confirmed these concerns, demonstrating that LLMs struggle with self-correction because they are not adept at identifying their own errors—an essential first step in self-correction. The same study also revealed that when errors are explicitly pointed out, LLMs are more capable of self-correcting. While these previous studies have focused on general reasoning errors, our work zeroes in on the function-calling ability of LLMs in multi-turn task-oriented dialogue (TOD) applications. Utilizing LLMs in multi-turn TOD datasets, such as MultiWOZ [15], SGD [98], and ToolTalk [41], 37 presents unique challenges. At each turn, the LLM must determine whether a tool call is necessary, which requires tracking user intent over multiple turns, identifying the correct moments for tool invocation, selecting the appropriate tool from a potentially large set, and accurately using the tool’s specifications. This setup is prone to a wide range of errors and failure modes, more so than in general conversational dialogue. In this work, we identify and categorize eight distinct error types, such as premature tool invocation—where an LLM calls a tool before gathering all necessary information from the user—and observation-reasoning errors, where the LLM correctly calls a tool and receives an output but misinterprets this output, leading to an incorrect response. 3.2 FIRE and Curriculum Learning 3.2.1 Introduction and Related Work The proliferation of textual data in the financial domain presents a unique opportunity for the application of machine learning and Natural Language Processing (NLP) techniques. The extraction of named entities and their relations from unstructured financial texts, such as Security and Exchange Commission (SEC) filings [124] and financial news articles [11], is a crucial task with significant implications for financial analysis and decision-making. NER [127] and RE [31] is a complex yet crucial task in NLP, particularly within the financial domain. The task demands extensive linguistic and domain knowledge, making dataset creation costly and labor-intensive. This complexity has led to instances where previously hand-labeled and published RE datasets have undergone subsequent corrections post-publication. Examples of such non-financial datasets include TACRED [140] and its revised counterpart, TACRED Revisited [2], as well as DocRED [135] and its updated version, Re-DocRED [118]. 38 The lack of a comprehensive, well-annotated dataset in the financial domain hampers the development and evaluation of algorithms for these tasks. In response to this identified gap, we present FIRE, a dataset specifically constructed for joint NER and RE within the financial domain. Drawn from both financial documents, mainly SEC filings, and financial news articles, FIRE provides a diverse range of linguistic constructs and financial terminologies. The dataset is constituted of 3,025 instances, all hand-labeled according to comprehensive annotation guidelines. Note that an instance (or an example) refers a labeled object, consisting of a single sentence or multiple sentences with associated entity and relation information. Figure 3.1a presents a labeled sentence from the dataset while figure 3.1b is one example of how the labeled data can be used to create a knowledge graph. More examples can be found in the annotation guidelines document which is provided with the dataset. The dataset incorporates 13 named entity categories and 18 relation types, effectively capturing vital details about businesses, including aspects such as their organizational structure, income streams, business strategies, and market maneuvers, including acquisitions. The FIRE dataset also serves as a substantial resource for training, evaluating, and comparing the performance of models specialized in the finance sector. Projects like 10-KGPT [111] and BloombergGPT [129], which are tailored for financial tasks, lack evaluation on dedicated financial RE datasets. FIRE fills this gap, offering a robust platform for testing these models against a diverse and complex set of financial terms and relationships. Our goal is to advance financial NLP by providing a high-quality, manually annotated dataset for refining state-of-the-art LLMs. An additional feature of FIRE is the inclusion of a labeling time data field for each record in the dataset. This feature may provide researchers with additional granularity when analyzing performance. Labeling time can serve as an implicit indicator of example difficulty, offering potential applications for the implementation of curriculum learning strategies [9]. By leveraging this feature, researchers can explore and develop methods that dynamically adjust the learning process based 39 (a) A sentence and its labels from the FIRE dataset. Entity terms are surrounded by a red box, with the entity type abbreviation annotated below the box. An edge between a pair of entities indicates a relation. (DA), (CO), (AC), (LO) and (SE) stand for Date, Company, Action, Location and Sector, respectively. (b) An example of constructing a Knowledge Graph (KG) using the labels from the sentence. All sentences in a dataset can be combined to create a KG that summarizes all the collected information. Figure 3.1: A labeled sentence from the FIRE dataset and an example of how a Knowledge Graph can be built using the collected labels. on the difficulty of the examples, potentially leading to more efficient learning and improved model performance. In our experiment results section, we provide an initial result of incorporating the labeling time feature into the training process. To the best of our knowledge, this has not been studied yet in the literature. Our contributions are summarized as follows: • We introduce FIRE, a novel dataset for joint NER and RE within the financial context. FIRE is accompanied by comprehensive annotation guidelines and is hand-annotated by a single annotator to minimize labeling noise. • We provide an open-source web-based labeling tool, designed to facilitate efficient and precise annotation for NER and RE tasks. 40 FinRED KPI-EDGAR FIRE (This Work) Hand-Labeled ✗ ✓ ✓ No. of Instances 7,775 1,355 3,025 No. of Entity Types N/A 12 13 No. of Entity Mentions 16,780 4,522 15,334 No. of Relation Types 29 1 18 No. of Relation Mentions 11,121 3,841 8,366 Table 3.1: Comparison of FinRED, KPI-EDGAR, and FIRE datasets. FIRE has the advantage over FinRED in that it is hand-annotated and over KPI-EDGAR in that it is larger, has diverse relations and is more comprehensive in terms of covering financial aspects over a business. Note that FinRED statistics for entity and relation mentions were not readily available. The figures included below were manually computed after a review of the FinRED data files. • We demonstrate that utilizing the labeling time of each example can enhance model performance through curriculum learning strategies Sentence vs. Document Level RE: Sentence-level RE identifies relationships between entities in a single sentence, while document-level RE captures relationships across multiple sentences or entire documents. Document-level RE offers a broader understanding of entity relationships, but sentence-level RE can pinpoint specific relationships more quickly. Document-level datasets include BC5CDR [73], DWIE [138], DocRED [135], and Re-DocRED [118]. Some popular sentence-level RE-datasets include TACRED [140], FB-NTY [53], and WebNLG [42]. While many of these are general-purpose, there are domain-specific datasets too [79, 95]. FIRE, despite having some multi-sentence instances, is mainly a sentence-level RE dataset. Relation Extraction Datasets and Distant Supervision. Creating RE datasets is costly due to labeling. One common technique to deal with this problem is distant supervision which relies on a knowledge base to automatically label text data [85]. In particular, sentences that mention two entities connected by a relation in the knowledge base are assumed to be expressing that same relation. This strong assumption leads to a large number of noisy samples. To address this issue, researchers have developed methods that relax the distant supervision assumptions[99, 9]. Despite 41 Figure 3.2: Scatter plot of labeling time (in seconds) versus the number of relations in the sentence. The marginal distributions and histograms are displayed at the edges of the plot. For sentences with the same number of relations, there is a wide distribution of labeling times, showing how the two quantities are correlated but still provide different information. its limitations, distant supervision remains a popular and effective method for generating large-scale datasets for relation extraction tasks. Several relation extraction datasets have been developed using distant supervision, including FB-NYT [53], a dataset constructed by aligning Freebase [13] relations with The New York Times articles, and WebNLG [42], a text generation dataset created from DBPedia [10], among others. Such datasets have been widely used for training and evaluating relation extraction models. Conversely, FIRE is a supervised dataset in which every instance has been annotated manually following extensive annotation guidelines. While this approach elevates the cost of labeling and poses scalability challenges, it guarantees a high level of precision in the labels. Financial Relation Extraction. Several NER and/or RE datasets in the financial domain have been previously proposed. FiNER-ORD [107] is an NER dataset automatically collected by applying pattern-matching heuristics on financial news articles. Unlike FIRE, this is an NER-only dataset with only three entity types. Another related work is [128], which established a Chinese corpus for relation extraction from financial news. However, this work focuses on relation extraction in the 42 Figure 3.3: Stages of data collection: 1) Manually gather relevant sentences. 2) Hand-label them to create a “seed” dataset. 3) Train an RE-specialized model on this dataset. 4) Use the model on new financial content to identify entities and relations. 5) From the model’s output, select sentences with low-confidence predictions to reduce confirmation bias. Remove existing labels from these sentences, manually annotate them, and merge with prior data. Repeat until the desired dataset size is achieved. Chinese language, while our dataset targets relation extraction in the English language. Two datasets that most closely resemble ours are FinRED, an RE dataset introduced in [108], and KPI-EDGAR, a joint NER and RE dataset introduced in [34]. Both are specialized in the financial domain. FinRED contains 7,775 instances covering 29 relation types and was collected from earning call transcripts and financial news articles. However, FinRED was labeled using the distant supervision technique, which can lead to a large number of noisy samples as outlined previously. In contrast, all instances in FIRE were hand-annotated by a human annotator. Similar to FIRE, the KPI-EDGAR dataset is also hand-annotated but the focus of this dataset is on extracting Key Performance Indicators (KPIs) from financial documents and linking them to their numerical values. It supports 12 entity types but only a single relation type, a binary link either exists between two entities or not. In contrast, FIRE supports an extensively diverse set of relations and its entities extend to broader business aspects, not being exclusively centered on KPIs. Table 3.1 compares the statistics of FIRE with both FinRED and KPI-EDGAR. Labeling Time and Curriculum Learning. In FIRE, we’ve included a ‘labeling time’ attribute for each instance. This data, representing the time it took the annotator to label that particular instance from the dataset, was gathered during the annotation stage without additional 43 cost. This could be useful to researchers examining annotation complexities or considering strategies like curriculum learning - a method inspired by progressive human learning, where models are exposed to easier samples first, gradually moving onto complex ones [9]. This method has been extensively applied in a variety of machine learning tasks [139, 64, 89]. A difficulty metric is required to apply curriculum learning. For example, a simple static (known a priori) difficulty metric for textual data can be the length of sentence in tokens. More sophisticated metrics are data-driven and adjust based on model feedback [81, 68]. In this context, we suggest that ‘labeling time’ may act as a proxy for the difficulty of an example. As illustrated in Figure 3.2 we observe a positive correlation between the labeling time of a sentence and the number of relations it contains. Despite this correlation, the labeling time can vary significantly for a fixed number of relations, indicating that it is not a redundant feature. Qualitatively similar results are observed when comparing labeling time to sentence length or number of entities in a sentence. In section 3.2.6, we provide an initial result of how incorporating the labeling time feature into the training process can improve the performance of trained models. 3.2.2 FIRE Dataset License and Intended Use License. The dataset and its associated resources are provided under the Creative Commons Attribution 4.0 International License (CC 4.0) [27]. The labeling tool developed in conjunction with the dataset is licensed under the MIT open-source license, see the LICENSE file for details. Intended Use. The intended use of the FIRE dataset is two-fold: First, to advance the research in the area of joint NER and RE, specifically within the financial domain. It is designed to serve as a benchmark for evaluating the performance of existing models, as well as a training resource for the development of new models. Second, the FIRE dataset can serve as a valuable resource for financial 44 analysts and auditors, enabling them to harness automated algorithms for expedient and efficient extraction of critical information from financial documents. 3.2.3 Data Splits and Statistics In Table 3.1, some basic statistics of the FIRE dataset are displayed. The different entity and relation types as well as their distribution in the dataset can be found in appendix B.1. The dataset was initially partitioned randomly into training, development (validation), and testing sets following a 70%, 15%, 15% split, respectively. Because financial reports, by their nature, often exhibit repetitive patterns in their language and structure, extra care was taken in creating the test set. Specifically, the Jaccard similarity score was computed for each pair of sentences from train and test sets. Jaccard similarity is defined as J(A, B) = |A∩B| |A∪B| , where A and B are sets of tokens in two instances. It measures the degree of similarity between two sets. Any sentence in the test set exhibiting a Jaccard similarity score above 50% with any sentence in the training set was replaced by a different sentence from the train set. This approach helps to reduce data leakage and ensures that the test set provides a robust and unbiased evaluation of model performance. 3.2.4 Data Collection and Annotation Annotator Pair Entity F1 (%) Relation F1 (%) Main Annotator and A 78.29 59.72 Main Annotator and B 70.57 49.19 Main Annotator and C 50.46 16.05 A and B 69.73 48.46 A and C 46.72 14.19 B and C 49.52 17.49 Table 3.2: Inter-annotator micro F1 scores. Annotators A and B are engineers familiar with the NER/RE task. Annotator C had no prior familiarity with the NER/RE task nor any expertise in engineering, finance, or linguistics. 45 Data Sources and Pre-Processing. Approximately 25% of the dataset’s records were sourced from publicly accessible financial news articles [11, 133, 26, 119, 120], while the remaining 75% were extracted from publicly available SEC filings such as 10-K and 10-Q financial reports. For the SEC filings, we used the dataset of Cleaned and Raw 10-X Files spanning the years 1993-2021 [84]. This dataset contains all 10-K variants, e.g., 10-Q, 10-K/A, 10-K405. Every report in this dataset has already been cleaned and parsed to remove all non-textual related objects. For the financial news pieces, we obtained the original articles directly from their respective sources and manually conducted the cleaning process to extract the raw text. Data Collection and Labeling. The process began by selecting a subset of financial reports and articles, as shown in Figure 3.3. An annotator identified and labeled key sentences with relevant entities and relations, creating a “seed” dataset. This dataset trained a joint NER and RE model (refer to 3.2.6), which then scanned new documents to suggest potential sentences. However, only the sentence selection was automated; actual labeling was always done manually. To mitigate confirmation bias, selections were deliberately made from low-confidence predictions generated by the model. Also, to reduce bias, the annotator was not shown the model’s predictions. This cycle continued until we achieved the desired dataset size, with all annotations done by a single non-domain expert human annotator, who is also the lead author of this work. Annotation Guidelines. For the FIRE dataset, a comprehensive set of labeling rules was established, incorporating both general entity and term annotation guidelines based on the ACL RD-TEC guidelines [96], as well as domain-specific rules tailored to each entity and relation present in the dataset. The guidelines also provide guidance for resolving ambiguous or conflicting edge cases. Inter-Annotator Agreement. To assess difficulty of the annotation task, a subset of 150 samples was randomly selected and provided to three independent annotators. Annotators A and 46 B were engineers with familiarity with the NER/RE task and annotator C was a professor with expertise outside of finance, engineering, and linguistics. Annotator A underwent several iterations of training to improve the quality of their annotations. In contrast, Annotators B and C were instructed to familiarize themselves with the annotation guidelines for 1-2 hours before starting the labeling task, without any prior training. The agreement between the annotators, including the main annotator of the dataset, was measured using the pair-wise entity and relations micro F1 score, as detailed in Table 3.2. This score was computed by treating one set of annotations as the ground truth labels and the other as predictions. Note that the result is the same regardless of which annotations were designated as ground truth. Although Cohen’s Kappa is usually the preferred metric for inter-annotator agreement, it is not suitable for the NER/RE task [29, 57]. The highest agreement was found with the annotator who received additional training. There was also greater agreement between the main annotator and annotator B as compared to annotator C, likely due to the annotator’s technical background and familiarity with the NER/RE task. These results suggest that the task has a high level of technical complexity and that, even with the detailed annotation guidelines, training of new annotators requires an iterative education process. Furthermore, even with some iteration in annotator training, as was the case for annotator A, the inter-annotator agreement indicates significant room for improvement. For this reason, the entire FIRE dataset is labeled by a single annotator who wrote the annotation guidelines and invested significant time and effort to ensure consistency. None of the results collected by the other annotators for the inter-annotator agreement study are contained in the final dataset. The consistent labeling of the FIRE dataset is confirmed by the results in section 3.2.6, where the F1 scores for trained models are much higher than the figures in Table 3.2. 47 3.2.5 Labeling Tool We introduce an open-source, web-based text annotation tool alongside the FIRE dataset ∗ . Tailored for entity and relation labeling, the tool offers features for efficient annotation and error minimization. It supports shortcuts for quick labeling and an optional rules file upload to set constraints on permissible relations between entity types, inspired by the work of [80]. For example, in FIRE, a rule might dictate that the ActionSell relation is exclusive to the Company entity type. This ensures accurate annotations by preventing incompatible entity-relation combinations. The tool also logs the annotation time for each instance, as detailed in section 3.2.1. 3.2.6 Experimental Results Algorithm 1: A Simple Curriculum Learning Algorithm Data: Dataset D, Difficulty metric M, Number of tiers N, Number of fine-tuning epochs E Result: Trained Model Θ 1 Divide D into N tiers (T1, T2, . . . , TN ) in increasing order of difficulty based on metric M; 2 Dcurrent = ∅; 3 for i = 1 to N do 4 Dcurrent = Dcurrent ∪ Ti ; 5 Train on Dcurrent for one epoch; 6 Fine-tune on entire dataset D for E epochs; 7 return Trained Model Θ Models To benchmark the performance of state-of-the-art models on FIRE, two family of models were selected for evaluation: RE-specialized models and general-purpose generative (causal) LLMs. RE-specialized models are models that were designed specifically to solve the RE, and possibly the NER, task. These models are usually built on top of a pre-trained base model such as BERT [36]. They are then customized to target the RE task by doing a combination of building a custom architecture, applying RE-specific data pre-processing and customizing the training procedure. ∗ https://github.com/abhinav-kumar-thakur/relation-extraction-annotator 48 On the other hand, general-purpose causal LLMs are designed with the language modeling objective and have no direct connection to the RE task. They can still be evaluated on this task by treating it as a sequence generation problem. Three RE-specialized models were selected: SpERT [39], PL-Marker [137] and REBEL [17]. SpERT effectively applies the Transformer architecture, complemented by a robust negative sampling strategy. It thus serves as a good starting point for evaluation. PL-Marker employs a unique marker mechanism to mark entity boundaries in sentences. Both models are built on top of the BERT architecture [36]. REBEL, on the other hand, is a sequence-to-sequence language model built on top of the BART architecture [72]. REBEL treats the relation extraction as a language generation task by expressing the triplet targets as a sequence of text. This provides an alternative perspective to this problem. Note that REBEL does not evaluate on entities. For general purpose generative models, we opted for Llama 2-7b [122] and GPT-3.5 [14], evaluating them in both few-shot and fine-tuned settings. Together, these models provide a reasonably comprehensive assessment of the FIRE dataset’s performance and potential. Model Class Model Evaluation Entity F1 (%) Relation F1 (%) RE-specialized models SpERT Standard Fine-Tuning 84.63±0.25 67.41±0.92 Curriculum Learning 85.39±0.33 68.11±0.53 PL-Marker Standard Fine-Tuning 83.78±0.18 67.01±0.67 Curriculum Learning 84.65±0.54 67.67±0.82 REBEL Standard Fine-Tuning − 68.25±0.44 Curriculum Learning − 68.93±0.52 General-purpose models Llama 2-7b Few-Shot 20.24±1.60 9.32±1.27 Standard Fine-Tuning 64.89±1.10 36.70±0.59 GPT 3.5 Few-Shot 56.68±1.06 16.50±0.39 Standard Fine-Tuning 81.48±0.18 57.50±1.57 Table 3.3: Performance of all models on the FIRE test data. Mean and standard deviation (in superscript) are reported for micro F1 score for both entities and relations. SpERT, PL-Marker, and REBEL are evaluated in two settings: Standard Fine-Tuning and Curriculum Learning. Llama 2-7b and GPT 3.5 are evaluated in a few-shot setting as well as in a standard fine-tuning setting. Note that the REBEL model does not compute entity metrics. 49 Setup and Evaluation Standard Fine-Tuning SpERT, PL-Marker and REBEL were each allotted 24 hours on an Nvidia GeForce RTX 2080 Ti GPU for hyper-parameter tuning on the validation set to find the best learning rate and batch size. The best performing model is then evaluated on the test set. More details can be found in appendix B.2. Llama 2-7b and GPT-3.5 were fine-tuned with a custom prompt (appendix B.3) and without hyper-parameter tuning due to computational constraints. Llama 2-7b underwent fine-tuning using QLoRA [33] based parameter-efficient techniques with 4bit configuration. For GPT 3.5, the fine-tuning is performed using the API provided by OpenAI [90]. Fine-tuning and evaluations are done using an Nvidia GeForce RTX 4060 Ti GPU and with a spending of around $100 for OpenAI APIs. Few-Shot Prompting For Llama 2-7b and GPT 3.5, a custom prompt was designed to evaluate both models in a few-shot setting. The prompt includes a definition and description of each relation type. For each iteration, the few-shot examples are randomly selected from the training set of the dataset. The models are then prompted to extract both entities and relations. Prompt details are in Appendix B.3. Curriculum Learning In addition to the standard training setup, another experiment was performed by training the three RE-specialized models according to a curriculum determined by the labeling time information. A very simple curriculum learning algorithm is used as described in algorithm 1. The training set is first divided into N tiers in increasing order of difficulty according to a metric M. Then, the model is trained successively for one epoch on each tier, as well as all previous tiers. Finally, the model is fine-tuned on the entire dataset for number of epochs E. In our experiment, we set N = 10 and E = 20 for all models. A compute budget of 24 hours is again given for each model to search for the best learning rate and batch size. 50 The difficulty metric M was computed as follows: given a sentence’s labeling time t, we consider the following features: the number of entities nent, the number of relations nrel and boolean variables indicating the length of the sentence as either short or medium, with large sentences encoded by setting both short and medium variables to zero. Using these features, we fit a simple linear regression model to predict t as: tˆ= β0 + β1 · nent + β2 · nrel (3.3) + β3 · short + β4 · medium (3.4) The difficulty metric M is then defined as the normalized residual of the actual and predicted labeling time: M = t − tˆ max(t) − min(t) (3.5) This metric gives us a sense of how much harder (or easier) a sentence is to label compared to what we’d expect (from tˆ) based solely on its features. Intuitively, a sentence with expected labeling time tˆ larger than actual labeling time t indicates that this may be an “easy sentence”, and the opposite is true. The reason M is not simply chosen to be the labeling time t is because a sentence with large t is not always “more difficult” to label than a sentence with smaller t. The difference could be due to the features discussed above, e.g. a sentence with large t could simply contain more entities but is actually easier to label than another sentence with smaller t. This is why proper normalization is required to choose M. Evaluation For each experiment category, three independent training runs were performed. The mean and standard deviation of the micro F1 score are reported. The exact match micro F1 score was used as the evaluation metric for relations, i.e. entity boundaries, entity types, as well as the relation label 51 must exactly match the ground truth labels to be considered correct. We use the train/eval/test splits for FIRE as reported in section 3.2.3. Figure 3.4: Comparison of F1 scores across each relation type in FIRE between standard fine-tuning and curriculum learning approaches using the SpERT model. The results highlight varying levels of difficulty in relation detection and may suggest an improvement in challenging relations when employing curriculum learning. Results Table 3.3 presents the results of all experiments. The three RE-specialized models display comparable performance and significantly outperform the inter-annotator agreement scores in Table 3.2, further indicating the consistent annotations in the dataset. Looking into the curriculum learning results, we see that curriculum learning enhanced the performance of all three models compared to standard training. This confirms our assumption that the labeling time is an informative feature that can be used to improve the generalization capabilities of the models. Table 3.3 also showcases the results for general-purpose generative LLMs. Fine-tuning outperforms few-shot learning significantly. GPT-3.5 surpasses Llama 2-7b, especially when fine-tuned. However, 52 these models still lag behind RE-specialized models. Our findings are consistent with a recent study [47] that also identified a significant performance gap between ChatGPT [92] and state-of-theart methods, particularly in more complex tasks. This can be explained by multiple factors, mainly the difficulty in doing strict evaluation of generative models which lack a fixed output format. This underscores the need for further research on using untrained causal LLMs for relation extraction, especially on datasets with diverse entity and relation types. Figure 3.4 compares the F1 scores per relation type for the SpERT model trained with standard fine-tuning versus curriculum learning. The performance patterns between the two techniques are generally similar: both training methods exhibit difficulties with the same relation types and perform better on others. This pattern cannot be attributed solely to class imbalances. Rather, it seems to arise from the complexity inherent in detecting certain relations. For instance, "ValueChangeDecreaseBy" is infrequent within the dataset (refer to Table B.2 in appendix B.1), yet the model demonstrates strong performance, likely due to the straightforward nature of detecting a relation involving a monetary value. On the other hand, "PropertyOf" appears more frequently but the model struggles in extracting this relation, potentially because of the complex nature of establishing this relation between two entities. Importantly, curriculum learning appears to enhance model performance on relation types that have lower F1 scores with standard fine-tuning, such as "ConstituentOf", "ProductOf", and "PropertyOf". This suggests that curriculum learning may improve model performance with more complex relations. However, further analysis is necessary to determine whether this improvement is consistent across various models and random seeds. Note that the labeling time feature is a sentence-level metric and not a relation-level metric. Therefore, a direct comparison between labeling time and per-relation score is not possible. Finally, while we employed a very simple curriculum learning algorithm, more advanced and sophisticated techniques have been proposed in the literature that can potentially achieve even 53 higher improvements. Nevertheless, our primary contribution focuses on the dataset, and a thorough evaluation of all curriculum learning techniques can be explored in future research. 3.3 Preliminary Work: Synthetic Data for NER 3.3.1 Synthetic Data and Knowledge Distillation In the data-bound regime, where acquiring high-quality annotated data is costly and labor-intensive, one solution is to generate synthetic to supplement the already labeled datasets. Advanced Large Language Models LLMs can generate synthetic data that mimics real-world examples, providing a valuable resource for training smaller, specialized models. However, the question arises: if LLMs are so advanced, why not use them directly for tasks like NER or RE? Despite their capabilities, LLMs still struggle with tasks requiring structured outputs such as information extraction. While LLMs excel at extracting broad concepts, they are less capable at consistently identifying specific entity types necessary for high performance in these tasks. This can be attribute to multiple reasons. First, while an LLM may easily identify generic entity or relation labels, it can struggle to identify more domain-specific labels such as a biomedical dataset or a scientific dataset. Second, since LLMs are usually expressive, it is difficult to evaluate its outputs against some reference labeled dataset. Furthermore, even if LLMs were to achieve state-of-the-art performance on these tasks, there are still compelling reasons to explore the use of synthetic data. Firstly, synthetic data can be used to train smaller models that are more efficient in terms of memory usage and compute during inference, which reduces costs and improves response times. Secondly, the use of synthetic data enables the development of open-source alternatives to proprietary models like GPT-4. This not only ensures independence and control over the models but also mitigates the risk of losing access to proprietary technologies. 54 The process of using synthetic data from an LLM to train a smaller model can be viewed as a form of knowledge distillation. Knowledge distillation involves transferring the knowledge from a large, complex model (teacher model) to a smaller, more efficient one (student model). This is typically achieved by training the smaller model to replicate the output distributions of the larger model, rather than just mimicking its final predictions. The technique uses the soft output of the teacher model as a source of information, in order to guide the smaller model to learn both the explicit and implicit patterns recognized by its larger counterpart. 3.3.2 The Partial Annotation Problem In our work, we will use one-shot prompting to generate synthetic data for NER. The LLM will be prompted by instructions and one labeled example from the training data. The LLM should then generate similar synthetic examples that contain new named entities, clearly marked in the text. By mirroring the distribution of the training data, this approach ensures that the newly generated named entities closely resemble the originally labeled ones, thereby increasing the likelihood of producing high-quality examples. Below is an example of the prompt used to generate synthetic data. Note that the “Example Input” are examples obtained from the labeled training dataset. Instructions: - You are given a JSON object that includes a sentence and some named entities occurring in the sentence alongside their types. - Propose new entities, and generate a sentence that includes the chosen entities. - Important: The proposed entities and sentence should follow the same domain and context as the given sentence. - The sentence should be grammatically correct and contextually coherent. - Format your response in JSON format following the input example. 55 - Only respond with the generated JSON object. Example Input: { "entity_1": "John", "type_1": "People", "entity_2": "Google", "type_2": "Organization", "generated_sentence": "John has been with Google for the past five years." } Full Annotation vs. Partial Annotation Problem. Given the above approach, a significant challenge arises when the LLM outputs sentences that include entities which are not labeled. This situation leads to partially annotated data. For instance, consider below an example of an LLM output following the prompt used above: LLM Output: { "entity_1": "Mary", "type_1": "People", "entity_2": "IBM", "type_2": "Organization", "generated_sentence": "Mary works for IBM in New York." } In this example, while the entities “Mary” and “IBM” are correctly used and labeled, the entity “New York” remains unlabeled. This creates a partially annotated example where the sentence is 56 coherent and relevant but contains unlabeled entities that can introduce noise into the training data. Training an NER model on this data could lead to sub-optimal performance, as the model might be penalized for correctly identifying unlabeled entities in text, such as “New York". The objective of this work is to develop a technique that allows for effective training on partially annotated synthetic data, thereby minimizing or eliminating the impact of this noise. In the next section, we will detail our proposed method and then present some experimental results. 3.3.3 Mask the Unknown Tags Consider the following setup for NER: • Use the sequence labeling formulation of NER which consists of the Beginning, Inside, and Outside (BIO) tags. • The input to our task will be the sentence while the output is the same sentence with BIO tags inserted after each token. • Train a sequence-to-sequence language model according to the proposed inputs and outputs. Our novel approach, termed "Mask the Unknown Tags," adapts the sequence labeling framework to synthetic data by selectively masking the LLM outputs. The methodology is straightforward: we apply a mask over the LLM-generated sequence such that it is only active over ’O’ tags—that is, tokens for which we lack confidence in the LLM’s labeling. The mask remains inactive on tokens from the original, trusted text, as well as on ’B’ and ’I’ tagged tokens that are confidently labeled. This targeted masking ensures that during the training process, the loss is only calculated over the unmasked, reliably labeled tokens. 57 Sentence Input: Mary B-Peop works O for O IBM B-Org in O New O York O Reference Output: Mary B-Peop works O for O IBM B-Org in O New O York O Loss Mask: T T T F T F T T T F T F T F In this scheme, ’T’ denotes tokens where the training loss is calculated, and ’F’ indicates masked tokens. The synthetic sentence "Mary works for IBM in New York" is partially labeled by the LLM, with "Mary" and "IBM" correctly tagged but "New York" omitted. Our mask would allow the model to learn from "Mary" and "IBM" while not penalizing the model for any predictions it makes about "New York." While this approach ensures the model’s learning is clean of any noise, it does have a drawback. Specifically, it results in reduced token efficiency. In any given synthetic sentence, only a portion of the tokens will contribute to learning, while the remainder, though potentially informative, are disregarded. This trade-off is considered acceptable to ensure that learning is driven by high-quality labeled data. 3.3.4 Experimental Results In this section, we present the experimental results of using our "Mask the Unknown Tags" approach. Datasets. To evaluate the performance of our proposed technique, two datasets were selected, a general-domain dataset and a scientific dataset. For the first set of experiments, we used the CONLL04 dataset [101], which contains 922 training sentences, with four types of entities and five types of relations, drawn from news reports. We expect synthetic data in this case to provide 58 large improvements over the baseline, as no specialized knowledge is required to generate data that resembles excerpts from news reports. On the other hand, the second dataset we evaluate on is the SciERC dataset [79], consisting of 1861 training sentences from scientific abstracts, with six entity types and seven relation types. In this case, we expect the task of generating synthetic data resembling the train distribution to be more challenging to the LLM, considering that domain-specific knowledge is required to understand and generate scientific entities. LLM for Synthetic Data. To generate the synthetic dataset, two recent LLMs were selected, GPT 3.5 [92] and Claude Sonnet † . Both these models are closed-sourced, so detailed information regarding the architectures, model sizes, training data is not available. Each model provider has an API that is used to query these models. These models were selected due to their high performance on various NLP benchmarks surpassing their counterparts [50]. Note that the Claude Sonnet’s benchmark results are higher than that of GPT 3.5, and so, one of the experimental observations will be comparing the quality and impact of synthetic data generated from LLMs with different strength levels. Baseline Model for Training. Since we formulated the NER problem into a sequence labeling task, a sequence-to-sequence model, Flan-T5-Base [25], is selected as the baseline model to evaluate our approach on. Flan-T5 is a variant of the T5 (Text-to-Text Transfer Transformer) models developed by Google [97]. Flan-T5 is specifically fine-tuned with a technique called “instruction tuning” which trains the model on a diverse set of tasks described in natural language, which helps the model better follow human-like instructions. Flan-T5-Base is one model from this family, with a 250 million parameters. This model size is estimated to be two orders of magnitude lower than the GPT 3.5 and Claude Sonnet models we use for synthetic data generation (information regarding † https://www.anthropic.com/news/claude-3-family 59 the exact sizes of these models is not available). This fact, alongside Flan-T5 being an open-source model, again highlights the motivations behind this framework. Synthetic Source Model Training Dataset 10% 40% 70% 100% N/A Annotations 0.24±0.23 68.89±1.01 84.59±0.09 87.16±0.96 GPT-3.5 Turbo + Synthetic 44.25±2.18 82.11±0.35 85.97±0.24 - GPT-3.5 Turbo + Masking 48.34±0.41 83.61±0.40 87.58±0.60 88.46±0.20 Claude Sonnet + Synthetic 28.52±2.32 81.64±0.23 85.89±0.45 - Claude Sonnet + Masking 49.34±0.20 83.10±0.38 87.59±0.42 88.36±0.43 Table 3.4: NER Performance on CoNLL04 Dataset Experimental Setup. Extensive evaluations are conducted to showcase the value of our approach. First, for each labeled training dataset, two synthetic datasets were generated, one using GPT 3.5 and one using Claude Sonnet. These synthetic datasets are almost the same size of the original training data. Second, our experiments were conducted with different ratios of the training data available, specifically 10%, 40%, 70% and 100%. This is to showcase the effect of synthetic data in varying levels of available data. All training simulations were run using the same hyper-parameters: 10 epochs, batch-size of 4 and a learning rate of 5e − 5. No hyper-parameter tuning was performed. Three random iterations are run for each training setup. We report the mean and standard deviation of the micro F1 score on the fixed test dataset. Synthetic Source Model Training Dataset 10% 40% 70% 100% N/A Annotations 0±0 50.26±1.10 62.47±0.60 65.17±0.18 GPT-3.5 Turbo + Synthetic 26.44±0.24 53.08±1.05 60.19±0.09 62.92±1.06 GPT-3.5 Turbo + Masking 30.50±0.79 57.68±0.36 63.23±0.34 64.00±0.43 Claude Sonnet + Synthetic 24.78±0.73 54.39±0.72 60.25±0.49 62.53±0.41 Claude Sonnet + Masking 35.09±3.92 60.31±0.72 63.58±0.43 65.23±0.59 Table 3.5: NER Performance on SciERC Dataset Results. Table 3.4 and Table 3.5 present the results of our experiments on the CONLL04 and SciERC datasets, respectively. The first row of each table is the baseline, i.e. training Flan-T5 with labeled training data only. The rows marked “+ Synthetic” are the results of training Flan-T5 60 with the labeled dataset and the synthetic dataset but without applying the proposed masking approach. These rows serve as an ablation study, showcasing the exact improvement the masking approach can provide, aside from the improvement already provided by the synthetic data. Three conclusions can be made from observing the results on both datasets. First, synthetic data alone improves the performance against the baseline by a large margin in lower-resource settings (10% and 40% of the training data). This increase in F1 score drops when more training data is available. Second, and most importantly, the masking approach consistently achieves higher F1 scores than directly using synthetic data, with the gain margin again decreasing as more data is available. This observation validates our assumptions regarding synthetic data noise and clearly showcases the benefit of out proposed method. Finally, one last observation is how the synthetic data from the Claude Sonnet models appears to increase performance larger than the data generated from GPT 3.5. The difference is also most noticeable on the SciERC dataset, a specialized scientific dataset. This evidence suggests that powerful models for synthetic data generation are especially needed for more complex and domain-specific datasets. 3.4 ToolCritic: Improving LLM Tool-Use in Multi-Turn Dialogue Applications 3.4.1 Background and Related Work The integration of tools with large language models (LLMs), also referred to as function-calling, has gained significant attention in recent years, driven by the enhanced reasoning capabilities of these models. The basic concept involves providing LLMs with a specification of functions—for example included in the system prompt—allowing the LLM to invoke a tool by outputting specific keywords when necessary. Since LLMs are primarily pretrained on language modeling tasks, additional training 61 Figure 3.5: ToolCritic evaluates every assistant response in a conversation, whether a tool was called or not (this is not shown in the diagram for space limitations). If ToolCritic detects an assistant error, it will produce a reasoning “thought” which will be fed back to the assistant. The assistant then will revise its own response based on the received feedback and produce an updated response. If ToolCritic does not detect a mistake, the conversation continues without interference. or fine-tuning is helpful to improve their function-calling abilities. Toolformer [106] employs a self-supervised training approach to enhance a model’s proficiency in tool invocation. In [94], the authors developed a large-scale dataset of user prompts and API calls, followed by self-instruct fine-tuning of a LLaMA-7B-based model, which significantly improved API invocation accuracy. What distinguishes our work from previous studies is our focus on tool usage in multi-turn dialogues, as opposed to single-turn question-answering. Multi-turn dialogues present a more complex challenge, requiring the LLM to invoke different tools at different turns, track user intent throughout the dialogue, interact with users to gather additional information or confirmations, and manage a dynamic conversational flow. LLMs are prone to reasoning errors and hallucinations, and these issues are often exacerbated in tool-calling scenarios. While self-correction techniques [82, 109]—where an LLM is provided with its own response for improvement—have shown potential, they have also faced criticism that raises doubts about their effectiveness [58, 123]. In our experiments, we observe that LLMs struggle to identify their own mistakes when interacting with tools. Our findings support this prior research which suggests that external feedback is crucial for correcting such errors. This conclusion has 62 also been supported by new approaches. For instance, in [83], the authors address the challenge of detecting bugs in LLM-generated code by training a GPT-4 model via reinforcement learning with human feedback (RLHF) on a dataset of user questions and buggy code. This model demonstrated significant improvements in bug detection compared to having the base GPT-4 model detect it’s own bugs. Another approach, the CRITIC framework [45], involves LLMs using external tools to receive feedback for self-correction, such as performing a Google search to do fact-checking. Unlike these studies, our work focuses more broadly on enhancing the fundamental tool-calling ability of LLMs within the context of conversational dialogue by utilizing a single diagnostic model. Motivated by the finding that LLMs are not proficient at self-correction, we propose ToolCritic: a model that detects errors in LLM tool usage during multi-turn TODs and provides descriptive feedback that the LLM can use to revise and improve its responses (see examples in Figure 3.5). Our results confirm that self-correction is not viable in this context and may even degrade performance. Unlike previous work that broadly examines LLM reasoning errors or focuses on the functioncalling ability of LLMs in isolation, we specifically investigate LLM function-calling in conversational dialogue. Our contributions can be summarized as follows: • We identify and categorize eight different types of mistakes that occur when LLMs interact with tools in multi-turn dialogues. We then build a synthetic dataset of dialogues with intentionally introduced mistakes from all eight categories. • We propose ToolCritic: a real-time diagnostic model that can identify, classify, and provide descriptive feedback for the LLM-as-Tool architecture in conversational dialogue. • Finally, we demonstrate the utility ToolCritic on real-world LLM errors using public datasets, showing up to a 13% improvement in dialogue success rates on the Schema Guided Dialogue 63 Figure 3.6: Each error category defines a specific mistake, or failure mode, that is common for LLMs when interacting with tools in multi-turn dialogue. This granularity ensures the resulting diagnostic model, ToolCritic, can provide very specific feedback for the assistant LLM. (SGD) dataset [98] with both closed- and open-sourced models compared to baseline and self-correction techniques. A closely related work is T-Eval [22], which evaluates the tool utilization capabilities of LLMs by dividing the task into multiple sub-processes and assessing different models using a constructed dataset. However, T-Eval is purely an evaluation study, whereas our work aims to develop a model that can automatically evaluate and correct tool usage errors in real-time. To investigate this problem, it is essential to examine datasets of conversational dialogues that involve tool-calling, particularly those focused on task-oriented dialogues. Examples include MultiWOZ [15], SGD [98], and TaskMaster [16]. While these datasets can simulate tool-calls through mock interfaces, others, such as API-Bank [74], ToolTalk [41], and Lucid [112], implement actual tool-calls within the dialogues, for example, via a python backed. For our study, we focus on the SGD dataset due to its large scale aspect and the diverse range of tools it includes. 3.4.2 Dataset Definition Error Categories Definition 64 The initial step in constructing our diagnostic system involves identifying the various types of mistakes that an LLM can make when interacting with tools during conversation. From our literature review, we identified that errors generally stem from issues in instruction following, reasoning, and planning. To achieve a more precise and effective diagnostic model, we further divided these errors into eight distinct categories, as illustrated in Figure 3.6. This level of granularity allows the model to diagnose errors more accurately and provides valuable specificity for subsequent self-correction. For instance, consider the “required arguments” and “optional arguments” error categories. Although these could be grouped together, distinguishing between them is critical due to the unique nature of each error type. An error in a required argument typically indicates a failure to follow the user’s request or a misalignment between the user-provided values and the function’s arguments. Conversely, an error involving an optional argument can occur for different reasons, such as the LLM mistakenly omitting an optional argument or adding an unnecessary one. By distinguishing between these error types, we enable the model to better understand the nuances of each failure mode, leading to more effective error detection and correction. Synthetically-Modified Error Dataset After defining the error types, the next step is to create a dataset where the inputs consist of conversations between a user and an assistant, with the assistant making one of the identified errors in tool usage. The output labels are twofold: first, the error category, and second, a reasoning “thought" that explains why the assistant’s response constitutes an error and how it can be detected from the perspective of a third party outside the dialogue. This “thought" is what will be used in fine-tuning to produce the diagnostic model, ToolCritic. Given that the effectiveness of ToolCritic is dependent on the dataset, we selected the SchemaGuided Dialogue (SGD) dataset [98] for this study. We began by manually crafting five to seven examples for each error category, resulting in a total of 46 hand-written examples. These examples 65 were carefully designed to span various domains and encompass different failure scenarios within each error category. Using these demonstration examples, we employed the Claude 3.5 Sonnet model with few-shot prompting to generate a comprehensive synthetic dataset. For each error category, all examples were provided to the Claude model in a random order at each turn, along with an additional hint to guide the model’s output. Claude is then given a new error-free example from the SGD dataset and is instructed to insert an error (mistake) somewhere in the conversation. We repeat this process for each error category. This approach allowed us to generate 300 data points per error category, resulting in a dataset of 2, 400 data points. Details explaining this data generation process and the prompt templates used are provided in Appendix D.1. 3.4.3 ToolCritic Fine-Tuning Setup To develop our diagnostic model, we fine-tune an open-source language model, specifically LLama 3.1 8B instruct, on the synthetic dataset we generated, which contains dialogues with intentional errors, as well as an equally sized set of “clean” dialogues randomly sampled from the SGD dataset. The dataset is split into training, validation, and test sets, with a 70%, 15%, 15% distribution. This results in a training dataset comprising 225 examples for each error category and an equal number of clean examples, totaling 3, 600 dialogues. During training, for each data point in the synthetic error dataset, the dialogue up to the turn containing the LLM error is fed into the model. The model is then trained to identify the error category and generate a reasoning "thought" explaining why the error occurred. For the clean portion of the training data, a subset of dialogue turns is randomly sampled from each dialogue, ranging from the first turn up to any turn before the final one, and the model is trained to recognize 66 that no error is present. This approach ensures that the model is not biased toward expecting a complete conversation as input and can function at any turn within an ongoing dialogue. We fine-tune the LLama model for 5 epochs using a learning rate of 1e-5 and a batch size of 2. Full fine-tuning is performed, meaning that all model weights are updated, using bfloat16 precision on a single p4.2xlarge instance equipped with 8 Nvidia A100 GPUs. The model weights are updated based solely on the reasoning text, not the dialogue itself ‡ . Additional training details and the prompts used for fine-tuning are provided in Appendix D.2. Test Evaluation Setup For the test dataset, evaluation is performed at each turn in the conversation, unlike during training. For example, a single “clean” dialogue data point consisting of N turns is transformed into N separate data points: one for each turn from the first through the Nth, with each data point labeled as “no error.” Similarly, each dialogue data point from the error dataset, where the last turn contains the error, is split into N data points. The first N-1 data points are labeled as “no error,” while the last data point (the conversation up to the turn containing the error) is labeled with the corresponding error category. This approach simulates a more realistic, real-world evaluation scenario, where errors are typically sparse, and most dialogue turns are error-free. This setup also highlights the importance of monitoring various metrics, such as precision and recall, which we discuss next. We also conduct fine-tuning experiments using different sizes of the training data: 10%, 25%, 50%, 75%, and 100% of the full training set. These additional experiments help us assess how different metrics improve with more data and whether increasing the dataset size yields a better diagnostic model. ‡See the "DataCollatorForCompletionOnlyLM" from the trl package https://huggingface.co/docs/trl/en/ sft_trainer 67 Figure 3.7: Results of ToolCritic on the test data split of the synthetic dataset. Precision, Recall and ROUGE score (a) and confusion matrix on the test dataset when using all training data, i.e. 225 examples per error category (b) We monitor three key metrics. The task of detecting error types is treated as a multi-class classification problem, for which we track precision and recall. To evaluate the quality of the model’s reasoning, we calculate the ROUGE score by comparing the model’s predicted reasoning against the reference reasoning. Figure 3.7 presents these results. Results The results show a clear improvement in all metrics with the addition of more training data. The model achieves a high recall of approximately 90%, though precision lags behind at around 80%. This discrepancy is expected given that the test dataset contains significantly more negative data (error-free dialogues) than positive data (dialogues with errors). In real-world applications, precision may be more critical to minimize false positives, and the appropriate threshold will depend on the specific use case. Regarding the impact of additional training data, we observe that recall and ROUGE scores are nearing saturation, while precision may still benefit from more data. Error Type Analysis To evaluate performance across different error types, The right-hand side of Figure 3.7 displays the confusion matrix for the test dataset using the model trained on the full training set. The results 68 indicate similar performance across most error categories, with some minor variations. We also observe the distribution of false positives, with the "non-invocation hesitation" category being the most common. Further analysis is required to understand how to mitigate this effect. Additionally, in Appendix D.3, we present the confusion matrix results in the low-label regime, when the model is trained on only 10% of the training data. This helps better differentiate the difficulty of various error categories. Experiments on Real LLM Mistakes In the previous section, we trained ToolCritic, a diagnostic model that demonstrated strong performance in detecting LLM tool-calling mistakes and providing detailed reasoning. However, the evaluation was conducted solely on synthetic data. In this section, we demonstrate the practical value of this model by applying it to the detection and correction of real LLM mistakes. To achieve this, we evaluate different LLMs on the test split of the SGD dataset, where the LLMs acts as the assistant and the user utterances serve as the ground truth. The LLMs’ predicted responses are then fed to the diagnostic model at each turn, which examines them for potential mistakes. If a mistake is detected, the model’s reasoning is fed back to the LLM for a single round of response revision. We also benchmark a self-correction approach without the diagnostic model, where the LLM is prompted to review and correct its own responses. Evaluation Setup Evaluating LLMs in multi-turn dialogues is inherently challenging, particularly when tool usage is involved. Traditional evaluation setups, which often focus on dialogue state tracking and metrics like inform rate, are becoming outdated due to the improved capabilities of modern LLMs. Ideally, the evaluation would involve the LLM interacting with a real human user and then assessing the LLM’s responses. However, this approach is impractical due to the high cost and complexity of involving human evaluators. 69 Assistant LLM Evaluation Success Rate Incorrect Action Precision Recall Scenario (%) ↑ Rate (%) ↓ (%) ↑ (%) ↑ Llama 3.1 70B Baseline (zero-shot) 5.46±5.46 84.47±7.83 21.51±4.11 40.68±7.34 Self-Correction 3.90±3.90 84.93±7.36 24.06±2.32 45.24±3.57 ToolCritic (partial feedback) 3.90±3.90 84.13±8.17 24.91±3.86 40.68±7.34 ToolCritic (full feedback)1 10.15±10.15 77.32±11.56 33.03±4.46 43.83±10.49 Mistral Large 2 Baseline (zero-shot) 7.35±0.17 91.85±0.11 18.54±0.18 34.26±0.45 Self-Correction 6.26±0.18 92.02±0.03 17.05±0.19 35.58±0.17 ToolCritic (partial feedback) 8.16±0.97 91.64±0.15 19.62±0.18 36.09±0.22 ToolCritic (full feedback) 10.61±0.94 84.74±0.61 26.90±0.45 50.24±0.55 Claude 3 Sonnet Baseline (zero-shot) 14.15±0.30 55.68±0.02 44.28±0.24 64.60±0.21 Self-Correction 16.62±0.58 54.04±0.03 46.31±0.02 66.62±0.59 ToolCritic (partial feedback) 19.95±0.35 52.63±0.13 50.21±0.10 67.66±0.06 ToolCritic (full feedback) 27.88 ±0.85 46.27±0.03 57.60±0.08 76.78±0.23 Table 3.6: Main results of the experimental evaluation on the SGD test data. Each experiment setup is repeated twice with random seeds. The mean and standard deviation (in superscript) are reported for each metric. A more feasible alternative is to use static user messages from a dataset and have the LLM generate the assistant’s responses. However, this method can result in incoherent conversations, as the assistant’s responses may diverge from the static user messages. To address these challenges, [132] proposed a setup where an LLM acts as both the user and the assistant, with another LLM serving as the judge to evaluate the responses. While this method provides an automatic dynamic evaluation, it relies on an LLM being reliable as a judge, which contradicts our premise that LLMs struggle to detect their own mistakes. Given these challenges, we adopt the evaluation setup proposed in [41], specifically tailored for assessing tool-augmented LLMs in dialogues. In this setup, the LLM is provided with the ground truth conversation history at each turn, rather than its own previous responses, and is tasked with predicting the next response. This approach mitigates the issue of incoherence, making it suitable for our focus on LLM function-calling ability. The downside of this approach is that it does not evaluate LLMs on being able to hold a coherent multi-turn conversation. This is not a issue for this work because we are interested in the LLM function-calling capabilities and not the dialogue capabilities. Models and Benchmarks 70 LLMs as assistant. In our experiments, we benchmark three different LLMs playing the role of the assistant: Claude 3 Sonnet [3], a closed-source model, and two open-source models, Mistral Large 2 [1] and Llama 3.1 70B instruct [38]. All three models possess tool-calling capabilities. Experiment Setup. For each assistant LLM, we evaluate three different setups: baseline (zero-shot) performance, self-correction (where the LLM corrects its own mistakes) following the Relextion method proposed in [109], and using ToolCritic as an external feedback. We also include an ablation study where ToolCritic provides only the predicted error description without the detailed reasoning (e.g., informing the LLM of a mistake in one of the arguments without specifying which one). This serves to evaluate the importance of the ToolCritic’s reasoning feedback. We set the temperature of all LLMs as well as the ToolCritic model to 0.1. Metrics. We follow the metrics defined in [41], focusing on precision, recall, and incorrect action rate. An action tool is one that makes changes in the real world, such as setting an alarm or booking a room, as opposed to passive tools like information retrieval. A tool-call is deemed correct if all arguments match the ground truth call, with fuzzy matching applied for free-form string arguments. Additionally, we monitor a dialogue-level metric called success rate, defined as the assistant correctly predicting all required tool-calls in the dialogue without making any unnecessary action tool calls (extra search tool calls are permitted). We run each experiment twice with random seeds and report the mean and standard deviation of each metric. Results Baseline. The experimental results are summarized in Table 3.6. Baseline performance varies across the LLMs, with Claude 3 Sonnet achieving the highest scores of 14.15%, 57.60% and 76.78% on success, precision, and recall, respectively. This highlights the difficulty of the task and how even the current advanced LLMs still struggle at using tools in multi-turn dialogues. Llama 3.1 achieves the lowest zero-shot performance with only 5.46% dialogue success rate. 71 Self-Correction. The poor results of self-correction reinforce our initial hypothesis that LLMs cannot effectively detect their own mistakes. In some cases, self-correction only leads to a minor improvement while in other cases it can even degrade performance. ToolCritic. When using the ToolCritic’s thought feedback, we observe significant improvements across all three models, with a maximum success rate increase of 13% for Claude 3 Sonnet. The gains for the Mistral and Llama models are smaller, likely due to their already poor performance in tool usage, suggesting that these models still struggle with tool usage in a dialogue context. The ablation study, where only the error description from ToolCritic was provided as feedback, also shows improvements over the baseline, though the gains are less than those achieved with the full reasoning feedback. This experiment underscores the value of the diagnostic model’s detailed feedback, showing that while simple error descriptions can be helpful for certain mistakes, more nuanced feedback is needed for addressing more complex mistakes. For example, if the assistant calls a tool prematurely, it may be enough to provide the error description of the “premature invocation” error category, without producing a specific reasoning. On the other hand, if the assistant makes a mistake in an optional argument of a tool call, simply providing the description of the “optional arguments” error may not be enough to correct the mistake. The full reasoning will be more useful in this case, as it can point out the specific arguments in question. Error Analysis While the previous section showcased the performance metrics of various models, in this section we take a closer look at the types of mistakes which the LLMs made the evaluation. Figure 3.8 displays the “error profile” of Claude 3 Sonnet, illustrating the distribution of errors detected by ToolCritic when evaluating the model on the SGD test set. We find that 77% of the time, the diagnostic model detects no mistakes. The most common detected mistake is “premature invocation”, where the LLM calls a tool before gathering all necessary information from the user. This aligns 72 with findings from a previous study [41]. The next most common mistakes relate to tool arguments, indicating the difficulty of accurately mapping user input to the correct arguments. Remaining error categories occur less frequently. This “error profile” can be valuable for comparing the weaknesses of different models and guiding future improvements. Generalization The results on the SGD dataset demonstrate a clear improvement in tool-calling accuracy when feedback from a diagnostic model such as ToolCritic is utilized. This finding supports the effectiveness of using external feedback to enhance LLM performance in multi-turn dialogues. However, evaluating the model on different datasets is equally important to assess its generalizability. Two datasets particularly relevant to this task are ToolTalk [41] and API-Bank [74]. Unfortunately, the relatively small sizes of these datasets make them unsuitable for training the diagnostic model. Instead of conducting a full-scale evaluation, we present several illustrative examples in Appendix D.4, where the ToolCritic model- fine-tuned on the SGD data—is applied to instances from the ToolTalk dataset. These examples suggest that some level of generalization to new tools is possible. However, the diagnostic model intuitively will always perform better on the same set of tools it was trained on. 3.4.4 Limitations Inference Cost. One drawback of utilizing external feedback from a diagnostic model is the increased inference cost. At each turn, an inference call to the diagnostic model is required, and if a mistake is detected, an additional inference call to the assistant LLM is necessary to revise its response. While the diagnostic model can be made relatively small and cost-effective, the extra inference call for the main assistant LLM can still be expensive, particularly in real-time applications. This emphasizes the need for an advanced tool-augmented LLM that commits less mistakes and therefore requires less response revisions. 73 Figure 3.8: “Error Profile” of Claude 3 Sonnet when evaluated on the test data split of the SGD dataset. The diagram presents the distribution of mistakes detected by ToolCritic when Claude 3 Sonnet acts as the assistant LLM. Data Labeling Cost. Another limitation of this approach is the requirement for labeled training data to fine-tune the diagnostic model. Although we demonstrated that decent performance improvement can be achieved with fine-tuning on synthetically modified data, more human-labeled data may be desired to ensure a more accurate and capable diagnostic model. Generalization. The current setup assumes that the defined error categories are comprehensive. If an assistant LLM makes an mistake that is not covered by these predefined categories, the diagnostic model may fail to detect it. This limitation highlights the challenge of ensuring that the error taxonomy is exhaustive and applicable to all possible failure modes in tool usage. Chained Tool Calls. Our experiments focused on single tool calls per turn, as this is the case in the SGD dataset. However, more complex real-world scenarios may involve multiple tool calls within a single turn to fulfill a user’s request. Extending the current setup to handle such 74 scenarios would require the introduction of additional error categories to capture the new failure modes associated with chained tool calls. 75 Chapter 4 Concluding Remarks and Suggested Future Work 4.1 Concluding Remarks In this work, we have explored the framework of training deep learning models within two distinct regimes: compute-bound and data-bound. A detailed theoretical study was performed on the fundamental processing of neural networks, and what alternatives are present. This was followed by outlining and implementing novel techniques to reduce computational overhead in training expansive models, with the use of new implementations of number system processing. Additionally, we studied the process of creating datasets from the ground up, with all the challenges and intricacies involved, and provided methods to increase the effectiveness of this process with the end-task in mind. Finally, we explored the potential of synthetic data to augment the training process in scenarios constrained by limited labeled data. Particularly, for LLM tool-use, we developed ToolCritic, a diagnostic model trained on synthetically-modified data to verify and critique tool-usage in LLMs. For the Named Entity Recognition (NER) task, we have demonstrated the effective application of synthetic data, showcasing our “Mask the Unknown Tags” approach to mitigate the partial annotation problem. Continuing in this direction, one of the future directions for this work is to explore and apply synthetic data for the task of RE. The next section presents our initial attempt at this problem and the current challenges faced. 76 4.2 Future Work: Synthetic Data for RE 4.2.1 Background In section 3.3, the problem of using synthetic data generated from LLMs to complement NER datasets and train NER models was explored. Our novel technique, dubbed “Mask the Unknown Tags”, proved to be effective in mitigating the noise in the synthetic data due to the partial annotation problem. While this approach proved effective for NER, the task of RE requires a different formulation. In Named Entity Recognition, the missing labels are explicitly present in the text input and so directly applying a mask over the output tokens can alleviate this problem. However, relations between entity pairs are implicit rather than explicit. One could think of ordering the output relations from a sentence input in such a way that allows for separating the false negatives from the true positive relations. Masking the loss over this kind of output would again work for this case. Unfortunately, there is no specific output ordering that allows us to achieve this formulation. In this following section, we describe our initial attempt to develop a strategy that extends this synthetic data framework to the problem of Relation Extraction. 4.2.2 Prompting using Entity Pair Combinations Similar to the problem faced by using synthetic data in the NER task, RE synthetic data suffers from the partial annotation noise. Consider the following prompt which is used to generate synthetic data from some LLM. Instructions: - You are given a JSON object that includes a sentence, named entities occurring in the sentence, and relations between entity pairs. - Propose new entities, and generate a sentence that expresses the same original relations 77 between the chosen entities. - Important: The entities and sentence should follow the same domain and context as the given sentence. - The sentence should be grammatically correct and contextually coherent. - Format your response in JSON format following the input example. - Only respond with the generated JSON object. Example Input: { "entity_1": "John", "type_1": "People", "entity_2": "Google", "type_2": "Organization", "relation_1": "(John, Work_for, Google)", "generated_sentence": "John has been with Google for the past five years." } A significant challenge arises when the LLM outputs sentences that include relations which are not labeled. This situation leads to partially annotated data. For instance, consider below an example of an LLM output following the prompt used above: LLM Output: { "entity_1": "Mary", "type_1": "People", "entity_2": "IBM", "type_2": "Organization", 78 "relation_1": "(Mary, Work_for, IBM)", "generated_sentence": "Mary works for IBM in New York." } In this example, while the entities “Mary” and “IBM” are correctly used and labeled while the entity “New York” remains unlabeled, and thus the relation “(Mary, Live_in, New York)” is not labeled. Assuming the relation type “Live_in” is part of the dataset definition, the synthetic example generated is only partially labeled, i.e. not all relation or named entities present in the sentence are labeled. This can introduce noise into the training data and training an RE model on this data could lead to sub-optimal performance, as the model might be penalized for correctly identifying unlabeled relations in text. To circumvent this problem, we suggest a reformulation of the training process. Instead of presenting the entire sentence with all entity mentions and expecting the model to infer all possible relations, we decompose the problem into multiple examples. Each example focuses on a subset of entities, and the model is trained to extract relations solely between those entities. By doing so, the model learns to confidently extract relations based on the given input, and thus cannot predict the unlabeled relations when their corresponding entities are not present in the input. Below we present an example of how this process creates multiple new artificial training examples from a single original example. Consider again the sentence “Mary works for IBM in New York”. Assuming this was a fully hand-labeled example, we can create four training examples as follows: Fully annotated example (Labeled Dataset): 79 • Combination #1 Prompt: Extract relations between the given entities occurring in the sentence. Relation Types: [’Work_for’, ’Live_In’, ’Kill’] Sentence: Mary works for IBM in New York. Entities: [’Mary’, ’IBM’] LLM Output: [(’Mary’, ’Work_for’, ’IBM’)] • Combination #2 Prompt: Extract relations between the given entities occurring in the sentence. Relation Types: [’Work_for’, ’Live_In’, ’Kill’] Sentence: Mary works for IBM in New York. Entities: [’Mary’, ’New York’] LLM Output: [(’Mary’, ’Live_in’, ’New York’)] • Combination #3 Prompt: Extract relations between the given entities occurring in the sentence. Relation Types: [’Work_for’, ’Live_In’, ’Kill’] Sentence: Mary works for IBM in New York. Entities: [’IBM’, ’New York’] LLM Output: [] • Combination #4 Prompt: Extract relations between the given entities occurring in the sentence. Relation Types: [’Work_for’, ’Live_In’, ’Kill’] Sentence: Mary works for IBM in New York. Entities: [’Mary’, ’IBM’, ’New York’] LLM Output: [(’Mary’, ’Work_for’, ’IBM’), (’Mary’, ’Live_in’, ’New York’)] 80 Now assuming that this sentence was part of our synthetic dataset, and only the entities ’Mary’ and ’IBM’ were labeled entities, then we would end up with a single training example as shown below Partially annotated example (Synthetic Dataset): • Only one combination Prompt: Extract relations between the given entities occurring in the sentence. Relation Types: [’Work_for’, ’Live_In’, ’Kill’] Sentence: Mary works for IBM in New York. Entities: [’Mary’, ’IBM’] LLM Output: [(’Mary’, ’Work_for’, ’IBM’)] Using this formulation, the model is not trained on noisy data. At the same time, the training process ensures that there is no mismatch between synthetic and hand-annotated examples. For both cases, the model is trained to extract a subset of relations corresponding to the subset of entities presented. Contrast this with the standard training process where the model would be expected to extract all possible relations from a sentence. 4.2.3 Experimental Results In this section, we present the experimental results of using our decomposition strategy for RE. Note that the experimental setup closely follows the same setup that was used for the NER experiments in Section 3.3. We repeat all the details here for clarity and self-containment. Datasets. To evaluate the performance of our proposed technique, two datasets were selected, a general-domain dataset and a scientific dataset. For the first set of experiments, we used the CONLL04 dataset [101], which contains 922 training sentences, with four types of entities and five types of relations, drawn from news reports. We expect synthetic data in this case to provide 81 large improvements over the baseline, as no specialized knowledge is required to generate data that resembles excerpts from news reports. On the other hand, the second dataset we evaluate on is the SciERC dataset [79], consisting of 1861 training sentences from scientific abstracts, with six entity types and seven relation types. In this case, we expect the task of generating synthetic data resembling the train distribution to be more challenging to the LLM, considering that domain-specific knowledge is required to understand and generate scientific entities. LLM for Synthetic Data. To generate the synthetic dataset, two recent LLMs were selected, GPT 3.5 [92] and Claude Sonnet ∗ . Both these models are closed-sourced, so detailed information regarding the architectures, model sizes, training data is not available. Each model provider has an API that is used to query these models. These models were selected due to their high performance on various NLP benchmarks surpassing their counterparts [50]. Note that the Claude Sonnet’s benchmark results are higher than that of GPT 3.5, and so, one of the experimental observations will be comparing the quality and impact of synthetic data generated from LLMs with different strength levels. Baseline Model for Training. Since we formulated the NER problem into a sequence labeling task, a sequence-to-sequence model, Flan-T5-Base [25], is selected as the baseline model to evaluate our approach on. Flan-T5 is a variant of the T5 (Text-to-Text Transfer Transformer) models developed by Google [97]. Flan-T5 is specifically fine-tuned with a technique called “instruction tuning” which trains the model on a diverse set of tasks described in natural language, which helps the model better follow human-like instructions. Flan-T5-Base is one model from this family, with a 250 million parameters. This model size is estimated to be two orders of magnitude lower than the GPT 3.5 and Claude Sonnet models we use for synthetic data generation (information regarding ∗ https://www.anthropic.com/news/claude-3-family 82 the exact sizes of these models is not available). This fact, alongside Flan-T5 being an open-source model, again highlights the motivations behind this framework. Experimental Setup. Extensive evaluations are conducted to showcase the value of our approach. First, for each labeled training dataset, two synthetic datasets were generated, one using GPT 3.5 and one using Claude Sonnet. These synthetic datasets are almost the same size of the original training data. Second, our experiments were conducted with different ratios of the training data available, specifically 10%, 40%, 70% and 100%. This is to showcase the effect of synthetic data in varying levels of available data. All training simulations were run using the same hyper-parameters: 10 epochs, batch-size of 4 and a learning rate of 5e − 5. No hyper-parameter tuning was performed. Three random iterations are run for each training setup. We report the mean and standard deviation of the micro F1 score on the fixed test dataset. Synthetic Model Training Dataset 10% 40% 70% 100% N/A Annotations 66.13±1.02% 75.49±0.84% 77.88±0.49% 80.86±0.69% GPT 3.5 Turbo + Synthetic 65.65±1.89% 76.24±0.12% 78.35±1.31% 80.73±0.30% Claude Sonnet + Synthetic 65.60±1.39% 74.61±0.28% 78.40±0.54% 79.75±1.00% Table 4.1: Evaluation results of RE on the CONLL04 dataset. Reported values are mean and standard deviation (in superscript) of micro F1 score. Synthetic Source Model Training Dataset 10% 40% 70% N/A Annotations 44.53±1.54% 60.40±0.35% 64.39±0.91% GPT 3.5 Turbo + Synthetic 44.91±0.40% 59.04±1.15% 62.53±0.87% Claude Sonnet + Synthetic 44.29±0.29% 61.20±0.84% 62.56±0.92% Table 4.2: Evaluation results of RE on the SciERC dataset. Reported values are mean and standard deviation (in superscript) of micro F1 score. Results. Table 4.1 and Table 4.2 present the results of our experiments on the CONLL04 and SciERC datasets, respectively. The first row of each table is the baseline, i.e. training Flan-T5 with labeled training data only. The rows marked “+ Synthetic” are the results of training Flan-T5 with the labeled dataset combined with the synthetic data and utilizing the proposed entity decomposition 83 approach. Results clearly show that no gain is observed from adding the synthetic data. In some cases we even observe some decrease in micro F1 score. There results indicate that the current formulation of the problem is not effective. One of the possible reasons for these negative results has to do with the new size of the training data. After applying the decomposition approach, the size of original training dataset will be artificially bloated. If a sentence contains n named entities, then Number of combinations of two or more entities from n entities = Xn k=2 n k Since synthetic data contains less named entities than the hand-labeled examples, it follows that the number of new artificial examples from the hand-labeled dataset will dominate the training dataset and the effect of synthetic data will be minimized. More analysis is needed to verify this conclusion. Future work into this problem will explore new and more efficient formulations of the problem that can effectively utilize synthetic data and achieve significant gains similar to those seen in the NER experiments (see Section 4.2). One solution to this problem might be in generating multiple relations per synthetic sentence. Having multiple labeled entities and relations in a sentence reduces the chance of any mislabels, for the simple fact that a sentence length is limited and cannot contain a large number of labels. The challenge here will be mainly in ensuring the generated sentences are actually coherent and express the desired relations. So far, we have shown that the most advanced LLMs are capable of expressing a single relation between a pair of entities in a sentence. Nonetheless, this task is much easier than having to express multiple relations at the same time. Proper post-processing is needed to ensure the desired output is reached and to filter out any incoherent or simply incorrect labels. Another direction might be data augmentation, i.e. given an input labeled sentence, apply some changes to create a new sentence while preserving the labels. This differs from our previous approach. So far we have looked at generating new examples, inspired by a the training dataset 84 but containing new entities, context and relations. While this ensures more diverse set of training data, it suffers from many challenges, especially the partial annotation problem. On the other hand, data augmentation relies on doing some changes to the training examples, without altering the labels. The advantage here is that, since no new labels are created, the partial annotation problem is less likely to occur. The downside is that the augmented dataset might not be diverse enough to add new information to the training data. A proper measure of diversity could be defined and the augmentation process could be setup to optimize this measure. 85 Bibliography [1] Mistral AI. Mistral Large 2. https://mistral.ai/news/mistral-large-2407/. Accessed: 2023-08-20. 2024. [2] Christoph Alt, Aleksandra Gabryszak, and Leonhard Hennig. “TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Ed. by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault. Association for Computational Linguistics, 2020, pp. 1558–1569. doi: 10.18653/V1/2020.ACL-MAIN.142. [3] Anthropic. Introducing the next generation of Claude. https://www.anthropic.com/news/claude-3-family. Accessed: 2023-08-20. 2024. [4] Mark Arnold, Ed Chester, and Corey Johnson. “Training Neural Nets using only an Approximate Tableless LNS ALU”. In: 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP). 2020, pp. 69–72. doi: 10.1109/ASAP49362.2020.00020. [5] Mark G. Arnold, Ed Chester, and Corey Johnson. “Training Neural Nets using only an Approximate Tableless LNS ALU”. In: 31st IEEE International Conference on Application-specific Systems, Architectures and Processors , ASAP 2020, Manchester, United Kingdom, July 6-8, 2020. IEEE, 2020, pp. 69–72. doi: 10.1109/ASAP49362.2020.00020. [6] MG Arnold, TA Bailey, JJ Cupal, and MD Winkel. “On the cost effectiveness of logarithmic arithmetic for backpropagation training on SIMD processors”. In: Neural Networks, 1997., International Conference on. Vol. 2. IEEE. 1997, pp. 933–936. [7] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. “Scalable methods for 8-bit training of neural networks”. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. Ed. by Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett. 2018, pp. 5151–5159. url: https: //proceedings.neurips.cc/paper/2018/hash/e82c4b19b8151ddc25d4d93baf7b908f-Abstract.html. 86 [8] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. “Scalable methods for 8-bit training of neural networks”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc., 2018. [9] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. “Curriculum learning”. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009. Ed. by Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman. Vol. 382. ACM International Conference Proceeding Series. ACM, 2009, pp. 41–48. doi: 10.1145/1553374.1553380. [10] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. “DBpedia - A crystallization point for the Web of Data”. In: Journal of Web Semantics 7.3 (2009). The Web of Data, pp. 154–165. issn: 1570-8268. doi: https://doi.org/10.1016/j.websem.2009.07.002. [11] Bloomberg - Financial news, analysis, and data. https://www.bloomberg.com/. 2023. [12] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. “End to End Learning for Self-Driving Cars”. In: CoRR abs/1604.07316 (2016). arXiv: 1604.07316. [13] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge”. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. SIGMOD ’08. Vancouver, Canada: Association for Computing Machinery, 2008, pp. 1247–1250. isbn: 9781605581026. doi: 10.1145/1376616.1376746. [14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. url: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64aPaper.pdf. [15] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Ultes Stefan, Ramadan Osman, and Milica Gašić. “MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2018. 87 [16] Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey, Kyu-Young Kim, and Andy Cedilnik. “Taskmaster-1:Toward a Realistic and Diverse Dialog Dataset”. In: 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Hong Kong, 2019. [17] Pere-Lluís Huguet Cabot and Roberto Navigli. “REBEL: Relation Extraction By End-to-end Language generation”. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. Ed. by Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih. Association for Computational Linguistics, 2021, pp. 2370–2381. doi: 10.18653/V1/2021.FINDINGS-EMNLP.204. [18] Yongqiang Cao, Yang Chen, and Deepak Khosla. “Spiking Deep Convolutional Neural Networks for Energy-Efficient Object Recognition”. In: Int. J. Comput. Vis. 113.1 (2015), pp. 54–66. doi: 10.1007/S11263-014-0788-3. [19] Vasileios Charisopoulos and Petros Maragos. “Morphological perceptrons: geometry and training algorithms”. In: International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer. 2017, pp. 3–15. [20] Yu-Hsin Chen, Tushar Krishna, Joel Emer, and Vivienne Sze. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”. In: IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers. 2016, pp. 262–263. [21] Jing Chen, Baotian Hu, Weihua Peng, Qingcai Chen, and Buzhou Tang. “Biomedical relation extraction via knowledge-enhanced reading comprehension”. In: BMC Bioinformatics 23.1 (2022), p. 20. doi: 10.1186/s12859-021-04534-5. [22] Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, and Feng Zhao. “T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9510–9529. url: https://aclanthology.org/2024.acl-long.515. [23] Jungwook Choi, Swagath Venkataramani, Vijayalakshmi Srinivasan, K. Gopalakrishnan, Zhuo Wang, and Pierce I-Jen Chuang. “Accurate and Efficient 2-bit Quantized Neural Networks”. In: Conference on Machine Learning and Systems. 2019. url: https://api.semanticscholar.org/CorpusID:96438794. 88 [24] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. “PaLM: Scaling Language Modeling with Pathways”. In: J. Mach. Learn. Res. 24 (2023), 240:1–240:113. url: http://jmlr.org/papers/v24/22-1144.html. [25] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. “Scaling Instruction-Finetuned Language Models”. In: CoRR abs/2210.11416 (2022). doi: 10.48550/ARXIV.2210.11416. arXiv: 2210.11416. [26] CNBC. https://www.cnbc.com. 2023. [27] Creative Commons. Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0/. Accessed: 2023. 2023. [28] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In: Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Ed. by Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh. 2022. url: http://papers.nips.cc/paper%5C_files/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5- Abstract-Conference.html. [29] Louise Deléger, Qi Li, Todd Lingren, Megan Kaiser, Katalin Molnár, Laura Stoutenborough, Michal Kouril, Keith A. Marsolo, and Imre Solti. “Building Gold Standard Corpora for Medical Natural Language Processing Tasks”. In: AMIA ... Annual Symposium proceedings. AMIA Symposium 2012 (2012), pp. 144–53. url: https://api.semanticscholar.org/CorpusID:12287694. [30] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848. 89 [31] Kartik Detroja, C. K. Bhensdadia, and Brijesh S. Bhatt. “A survey on Relation Extraction”. In: Intell. Syst. Appl. 19 (2023), p. 200244. doi: 10.1016/J.ISWA.2023.200244. [32] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. “GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale”. In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 30318–30332. [33] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. “QLoRA: Efficient Finetuning of Quantized LLMs”. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine. 2023. url: http://papers.nips.cc/paper%5C_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049bAbstract-Conference.html. [34] Tobias Deußer, Syed Musharraf Ali, Lars Patrick Hillebrand, Desiana Nurchalifah, Basil Jacob, Christian Bauckhage, and Rafet Sifa. “KPI-EDGAR: A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents”. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) (2022), pp. 1654–1659. [35] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [37] Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, and Keith M. Chugg. “Pre-Defined Sparse Neural Networks With Hardware Acceleration”. In: IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.2 (2019), pp. 332–345. doi: 10.1109/JETCAS.2019.2910864. [38] Abhimanyu Dubey et al. “The Llama 3 Herd of Models”. In: CoRR abs/2407.21783 (2024). doi: 10.48550/ARXIV.2407.21783. arXiv: 2407.21783. 90 [39] Markus Eberts and Adrian Ulges. “Span-Based Joint Entity and Relation Extraction with Transformer Pre-Training”. In: ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020). Ed. by Giuseppe De Giacomo, Alejandro Catalá, Bistra Dilkina, Michela Milano, Senén Barro, Alberto Bugarín, and Jérôme Lang. Vol. 325. Frontiers in Artificial Intelligence and Applications. IOS Press, 2020, pp. 2006–2013. doi: 10.3233/FAIA200321. [40] Jesper E. van Engelen and Holger H. Hoos. “A survey on semi-supervised learning”. In: Machine Learning 109.2 (2020), pp. 373–440. doi: 10.1007/s10994-019-05855-6. [41] Nicholas Farn and Richard Shin. “ToolTalk: Evaluating Tool-Usage in a Conversational Setting”. In: CoRR abs/2311.10775 (2023). doi: 10.48550/ARXIV.2311.10775. arXiv: 2311.10775. [42] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. “The WebNLG Challenge: Generating Text from RDF Data”. In: INLG:2017:35. Ed. by Jose M. Alonso, Alberto Bugarín, and Ehud Reiter. Santiago de Compostela, Spain: acl, Sept. 2017, pp. 124–133. doi: 10.18653/v1/W17-3518. [43] Alireza Ghaffari, Marzieh S. Tahaei, Mohammadreza Tayaranian, Masoud Asgharian, and Vahid Partovi Nia. “Is Integer Arithmetic Enough for Deep Learning Training?” In: Advances in Neural Information Processing Systems. Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 27402–27413. [44] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. “Knowledge Distillation: A Survey”. In: Int. J. Comput. Vis. 129.6 (2021), pp. 1789–1819. doi: 10.1007/S11263-021-01453-Z. [45] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. “CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing”. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. url: https://openreview.net/forum?id=Sx038qxjek. [46] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. “Deep learning with limited numerical precision”. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. ICML’15. Lille, France: JMLR.org, 2015, pp. 1737–1746. [47] Ridong Han, Tao Peng, Chaohao Yang, Benyou Wang, Lu Liu, and Xiang Wan. “Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors”. In: ArXiv abs/2305.14450 (2023). url: https://api.semanticscholar.org/CorpusID:258865200. 91 [48] Song Han, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding”. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2016. [49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [50] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. “Measuring Massive Multitask Language Understanding”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. url: https://openreview.net/forum?id=d7KBjmI3GmQ. [51] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the Knowledge in a Neural Network”. In: CoRR abs/1503.02531 (2015). arXiv: 1503.02531. url: http://arxiv.org/abs/1503.02531. [52] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. “Training Compute-Optimal Large Language Models”. In: CoRR abs/2203.15556 (2022). doi: 10.48550/ARXIV.2203.15556. arXiv: 2203.15556. [53] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. “Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations”. In: Annual Meeting of the Association for Computational Linguistics. 2011. [54] Kurt Hornik. “Approximation capabilities of multilayer feedforward networks”. In: Neural networks 4.2 (1991), pp. 251–257. [55] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks”. In: Neural networks 3.5 (1990), pp. 551–560. [56] Mark Horowitz. “1.1 Computing’s energy problem (and what we can do about it)”. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). 2014, pp. 10–14. doi: 10.1109/ISSCC.2014.6757323. [57] George Hripcsak and Adam S. Rothschild. “Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval”. In: Journal of the American Medical Informatics Association : JAMIA 12 3 (2005), pp. 296–8. url: https://api.semanticscholar.org/CorpusID:17077165. 92 [58] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. “Large Language Models Cannot Self-Correct Reasoning Yet”. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. url: https://openreview.net/forum?id=IkmD3fKBPQ. [59] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. “In-Datacenter Performance Analysis of a Tensor Processing Unit”. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. ISCA ’17. Toronto, ON, Canada: Association for Computing Machinery, 2017, pp. 1–12. isbn: 9781450348928. doi: 10.1145/3079856.3080246. [60] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. “Highly accurate protein structure prediction with AlphaFold”. In: Nature 596.7873 (2021), pp. 583–589. doi: 10.1038/s41586-021-03819-2. [61] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models”. In: CoRR abs/2001.08361 (2020). arXiv: 2001.08361. url: https://arxiv.org/abs/2001.08361. [62] Nick G Kingsbury and Peter JW Rayner. “Digital filtering using logarithmic arithmetic”. In: Electronics Letters 2.7 (1971), pp. 56–58. [63] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by Simulated Annealing”. In: Science 220.4598 (1983), pp. 671–680. doi: 10.1126/science.220.4598.671. eprint: https://www.science.org/doi/pdf/10.1126/science.220.4598.671. 93 [64] Tom Kocmi and Ondřej Bojar. “Curriculum Learning and Minibatch Bucketing in Neural Machine Translation”. In: RANLP:2017:1. Ed. by Ruslan Mitkov and Galia Angelova. Varna, Bulgaria: INCOMA Ltd., Sept. 2017, pp. 379–386. doi: 10.26615/978-954-452-049-6_050. [65] Raghuraman Krishnamoorthi. “Quantizing deep convolutional networks for efficient inference: A whitepaper”. In: CoRR abs/1806.08342 (2018). arXiv: 1806.08342. [66] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Tech. rep. University of Toronto, 2009. [67] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. Curran Associates, Inc., 2012. [68] M. Pawan Kumar, Benjamin Packer, and Daphne Koller. “Self-Paced Learning for Latent Variable Models”. In: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada. Ed. by John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta. Curran Associates, Inc., 2010, pp. 1189–1197. url: https: //proceedings.neurips.cc/paper/2010/hash/e57c6b956a6521b28495f2886ca0977a-Abstract.html. [69] Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M. Chugg, and Peter A. Beerel. “Pre-Defined Sparsity for Low-Complexity Convolutional Neural Networks”. In: IEEE Transactions on Computers 69.7 (2020), pp. 1045–1058. doi: 10.1109/TC.2020.2972520. [70] Y. Le and X. Yang. Tiny ImageNet Visual Recognition Challenge. http://cs231n.stanford.edu/reports/2015/pdfs/yle_project.pdf. 2017. [71] Samuel C. Lee and Albert D. Edgar. “The focus number system”. In: IEEE Transactions on Computers 26.11 (1977), pp. 1167–1170. [72] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel-rahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Annual Meeting of the Association for Computational Linguistics. 2019. [73] Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. “BioCreative V CDR task corpus: a resource for chemical disease relation extraction”. In: Database 2016 (May 2016). baw068. issn: 1758-0463. doi: 10.1093/database/baw068. eprint: https: //academic.oup.com/database/article-pdf/doi/10.1093/database/baw068/8224483/baw068.pdf. 94 [74] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Association for Computational Linguistics, 2023, pp. 3102–3116. doi: 10.18653/V1/2023.EMNLP-MAIN.187. [75] Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. “Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Ed. by Houda Bouamor, Juan Pino, and Kalika Bali. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10443–10461. doi: 10.18653/v1/2023.emnlp-main.647. [76] Chang Liu, Rui Zhang, Xishan Zhang, Yifan Hao, Zidong Du, Xing Hu, Ling Li, and Qi Guo. Ultra-low Precision Multiplication-free Training for Deep Neural Networks. 2023. arXiv: 2302.14458 [cs.LG]. [77] Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best Practices and Lessons Learned on Synthetic Data for Language Models. 2024. arXiv: 2404.07503 [cs.CL]. [78] Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent with Warm Restarts”. In: International Conference on Learning Representations, ICLR. 2017. [79] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. “Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Ed. by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 3219–3232. doi: 10.18653/v1/D18-1360. [80] Shengfei Lyu and Huanhuan Chen. “Relation Classification with Entity Type Restriction”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, Aug. 2021, pp. 390–395. doi: 10.18653/v1/2021.findings-acl.34. [81] Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. “Self-Paced Co-training”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017, pp. 2275–2284. url: https://proceedings.mlr.press/v70/ma17b.html. 95 [82] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. “Self-Refine: Iterative Refinement with Self-Feedback”. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine. 2023. url: http://papers.nips.cc/paper%5C_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3- Abstract-Conference.html. [83] Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. “LLM Critics Help Catch LLM Bugs”. In: CoRR abs/2407.00215 (2024). doi: 10.48550/ARXIV.2407.00215. arXiv: 2407.00215. [84] Bill McDonald. Cleaned and Raw 10-X Files, Software Repository for Accounting and Finance. University of Notre Dame, Mendoza College of Business. 2023. url: https://sraf.nd.edu/sec-edgar-data/cleaned-10x-files/. [85] Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. “Distant supervision for relation extraction without labeled data”. In: ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore. Ed. by Keh-Yih Su, Jian Su, and Janyce Wiebe. The Association for Computer Linguistics, 2009, pp. 1003–1011. url: https://aclanthology.org/P09-1113/. [86] John N. Mitchell. “Computer Multiplication and Division Using Binary Logarithms”. In: IRE Transactions on Electronic Computers EC-11.4 (1962), pp. 512–517. doi: 10.1109/TEC.1962.5219391. [87] Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional Neural Networks using Logarithmic Data Representation. 2016. arXiv: 1603.01025 [cs.NE]. [88] Daisuke Miyashita, Edward H. Lee, and Boris Murmann. “Convolutional Neural Networks using Logarithmic Data Representation”. In: CoRR abs/1603.01025 (2016). arXiv: 1603.01025. [89] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor, and Peter Stone. “Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey”. In: J. Mach. Learn. Res. 21.1 (2020). issn: 1532-4435. [90] OpenAI. Fine-tuning Examples - OpenAI API. https://platform.openai.com/docs/guides/fine-tuning/fine-tuning-examples. Accessed: 12-2023. 2023. [91] OpenAI. “GPT-4 Technical Report”. In: CoRR abs/2303.08774 (2023). doi: 10.48550/ARXIV.2303.08774. arXiv: 2303.08774. [92] OpenAI. “Introducing chatgpt”. In: OpenAI Blog (2023). url: https://openai.com/blog/chatgpt. 96 [93] P. Robertson, E. Villebrum, and P. Hoeher. “A comparison of optimal and suboptimal MAP decoding algorithms operating in the log domain”. In: Proceeding, IEEE International Conference on Communications. Seattle, WA, 1995, pp. 1009–1013. [94] Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. “Gorilla: Large Language Model Connected with Massive APIs”. In: CoRR abs/2305.15334 (2023). doi: 10.48550/ARXIV.2305.15334. arXiv: 2305.15334. [95] Nadeesha Perera, Matthias Dehmer, and Frank Emmert-Streib. “Named Entity Recognition and Relation Detection for Biomedical Information Extraction”. In: Frontiers in Cell and Developmental Biology 8 (2020). issn: 2296-634X. doi: 10.3389/fcell.2020.00673. [96] Behrang QasemiZadeh and Anne-Kathrin Schumann. “The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods”. In: LREC:2016:1. Ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Portorož, Slovenia: European Language Resources Association (ELRA), May 2016, pp. 1862–1868. [97] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: J. Mach. Learn. Res. 21 (2020), 140:1–140:67. url: http://jmlr.org/papers/v21/20-074.html. [98] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. “Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 8689–8696. doi: 10.1609/AAAI.V34I05.6394. [99] Sebastian Riedel, Limin Yao, and Andrew McCallum. “Modeling Relations and Their Mentions without Labeled Text”. In: Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, 2010, Proceedings, Part III. Ed. by José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag. Vol. 6323. Lecture Notes in Computer Science. Springer, 2010, pp. 148–163. doi: 10.1007/978-3-642-15939-8\_10. [100] Gerhard X Ritter and Peter Sussner. “An introduction to morphological neural networks”. In: Proceedings of 13th International Conference on Pattern Recognition. Vol. 4. IEEE. 1996, pp. 709–717. [101] Dan Roth and Wen-tau Yih. “A Linear Programming Formulation for Global Inference in Natural Language Tasks”. In: Conference on Computational Natural Language Learning. 2004. 97 [102] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. “Towards spike-based machine intelligence with neuromorphic computing”. In: Nature 575.7784 (2019), pp. 607–617. doi: 10.1038/s41586-019-1677-2. [103] Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. “From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference”. In: 2023 IEEE High Performance Extreme Computing Conference (HPEC). 2023, pp. 1–9. doi: 10.1109/HPEC58863.2023.10363447. [104] Arnab Sanyal, Peter A. Beerel, and Keith M. Chugg. “Neural Network Training with Approximate Logarithmic Computations”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, pp. 3122–3126. doi: 10.1109/ICASSP40776.2020.9053015. [105] Arnab Sanyal, Peter A. Beerel, and Keith M. Chugg. “Neural Network Training with Approximate Logarithmic Computations”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020, pp. 3122–3126. doi: 10.1109/ICASSP40776.2020.9053015. [106] Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “Toolformer: Language Models Can Teach Themselves to Use Tools”. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine. 2023. url: http://papers.nips.cc/paper%5C_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906- Abstract-Conference.html. [107] Agam Shah, Ruchit Vithani, Abhinav Gullapalli, and Sudheer Chava. “FiNER: Financial Named Entity Recognition Dataset and Weak-Supervision Model”. In: CoRR abs/2302.11157 (2023). doi: 10.48550/ARXIV.2302.11157. arXiv: 2302.11157. [108] Soumya Sharma, Tapas Nayak, Arusarka Bose, Ajay Kumar Meena, Koustuv Dasgupta, Niloy Ganguly, and Pawan Goyal. “FinRED: A Dataset for Relation Extraction in Financial Domain”. In: Companion Proceedings of the Web Conference 2022. WWW ’22. Virtual Event, Lyon, France: Association for Computing Machinery, 2022, pp. 595–597. isbn: 9781450391306. doi: 10.1145/3487553.3524637. [109] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. “Reflexion: language agents with verbal reinforcement learning”. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Ed. by Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine. 2023. url: http://papers.nips.cc/paper%5C_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90- Abstract-Conference.html. 98 [110] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. url: http://arxiv.org/abs/1409.1556. [111] Dorian Smiley. 10-kGPT: Analyze 10-Q and 10-K fillings with GPT. https://github.com/doriansmiley/10-kGPT. 2023. [112] Joe Stacey, Jianpeng Cheng, John Torr, Tristan Guigue, Joris Driesen, Alexandru Coca, Mark Gaynor, and Anders Johannsen. “LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues”. In: CoRR abs/2403.00462 (2024). doi: 10.48550/ARXIV.2403.00462. arXiv: 2403.00462. [113] Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi (Viji) Srinivasan, and Kailash Gopalakrishnan. “Ultra-Low Precision 4-bit Training of Deep Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 1796–1807. [114] Xiao Sun, Naigang Wang, Chia-Yu Chen, Jiamin Ni, Ankur Agrawal, Xiaodong Cui, Swagath Venkataramani, Kaoutar El Maghraoui, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. “Ultra-Low Precision 4-bit Training of Deep Neural Networks”. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. 2020. url: https: //proceedings.neurips.cc/paper/2020/hash/13b919438259814cd5be8cb45877d577-Abstract.html. [115] Peter Sussner. “Morphological perceptron learning”. In: Proceedings of the 1998 IEEE International Symposium on Intelligent Control (ISIC) held jointly with IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA) Intell. IEEE. 1998, pp. 477–482. [116] Earl E Swartzlander and Aristides G Alexopoulos. “The sign/logarithm number system”. In: IEEE Transactions on Computers 24.12 (1975), pp. 1238–1242. [117] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9. [118] Qingyu Tan, Lu Xu, Lidong Bing, Hwee Tou Ng, and Sharifah Mahani Aljunied. “Revisiting DocRED - Addressing the False Negative Problem in Relation Extraction”. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. Ed. by Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang. Association for Computational Linguistics, 2022, pp. 8472–8487. doi: 10.18653/V1/2022.EMNLP-MAIN.580. 99 [119] The Economic Times. https://economictimes.indiatimes.com. 2023. [120] The Financial Express. https://www.financialexpress.com. 2023. [121] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen S. Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Agüera y Arcas, Claire Cui, Marian Croak, Ed H. Chi, and Quoc Le. “LaMDA: Language Models for Dialog Applications”. In: CoRR abs/2201.08239 (2022). arXiv: 2201.08239. url: https://arxiv.org/abs/2201.08239. [122] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: ArXiv abs/2307.09288 (2023). url: https://api.semanticscholar.org/CorpusID:259950998. [123] Gladys Tyen, Hassan Mansoor, Victor Carbune, Peter Chen, and Tony Mak. “LLMs cannot find reasoning errors, but can correct them given the error location”. In: Findings of the Association for Computational Linguistics ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics, Aug. 2024, pp. 13894–13908. url: https://aclanthology.org/2024.findings-acl.826. [124] U.S. Securities and Exchange Commission (SEC). https://www.sec.gov/. 2023. [125] Maolin Wang, Seyedramin Rasoulinezhad, Philip H. W. Leong, and Hayden K.-H. So. “NITI: Training Integer Neural Networks Using Integer-Only Arithmetic”. In: IEEE Transactions on Parallel and Distributed Systems 33.11 (2022), pp. 3249–3261. doi: 10.1109/TPDS.2022.3149787. 100 [126] Qingyang Wang, Michael A. Powell, Ali Geisa, Eric Bridgeford, and Joshua T. Vogelstein. “Why Do Networks Need Negative Weights?” In: CoRR abs/2208.03211 (2022). doi: 10.48550/ARXIV.2208.03211. arXiv: 2208.03211. [127] Yan Wen, Cong Fan, Geng Chen, Xin Chen, and Ming Chen. “A Survey on Named Entity Recognition”. In: Communications, Signal Processing, and Systems - Proceedings of the 8th International Conference on Communications, Signal Processing, and Systems, CSPS 2019, Urumqi, China, 20-22 July 2019. Ed. by Qilian Liang, Wei Wang, Xin Liu, Zhenyu Na, Min Jia, and Baoju Zhang. Vol. 571. Lecture Notes in Electrical Engineering. Springer, 2019, pp. 1803–1810. doi: 10.1007/978-981-13-9409-6\_218. [128] Haoyu Wu, Qing Lei, Xinyue Zhang, and Zhengqian Luo. “Creating A Large-Scale Financial News Corpus for Relation Extraction”. In: 2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD) (2020), pp. 259–263. [129] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, D Rosenberg, and Gideon Mann. “BloombergGPT: A Large Language Model for Finance”. In: ArXiv abs/2303.17564 (2023). [130] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. “Training and Inference with Integers in Deep Neural Networks”. In: International Conference on Learning Representations. 2018. url: https://openreview.net/forum?id=HJGXzmspb. [131] Rui Xing, Jie Luo, and Tengwei Song. “BioRel: towards large-scale biomedical relation extraction”. In: BMC Bioinformatics 21.16 (2020), p. 543. doi: 10.1186/s12859-020-03889-5. [132] Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun, and Heyan Huang. “Rethinking Task-Oriented Dialogue Systems: From Complex Modularity to Zero-Shot Autonomous Agent”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 2748–2763. url: https://aclanthology.org/2024.acl-long.152. [133] Yahoo Finance. https://finance.yahoo.com. 2023. [134] Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. “A Survey on Deep Semi-Supervised Learning”. In: IEEE Trans. Knowl. Data Eng. 35.9 (2023), pp. 8934–8954. doi: 10.1109/TKDE.2022.3220219. [135] Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. “DocRED: A Large-Scale Document-Level Relation Extraction Dataset”. In: ACL:2019:1. Ed. by Anna Korhonen, David Traum, and Lluís Màrquez. Florence, Italy: acl, July 2019, pp. 764–777. doi: 10.18653/v1/P19-1074. [136] Dmitry Yarotsky. “Error bounds for approximations with deep ReLU networks”. In: Neural Networks 94 (2017), pp. 103–114. doi: 10.1016/J.NEUNET.2017.07.002. 101 [137] Deming Ye, Yankai Lin, Peng Li, and Maosong Sun. “Packed Levitated Marker for Entity and Relation Extraction”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. Ed. by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio. Association for Computational Linguistics, 2022, pp. 4904–4917. doi: 10.18653/V1/2022.ACL-LONG.337. [138] Klim Zaporojets, Johannes Deleu, Chris Develder, and Thomas Demeester. “DWIE: An entity-centric dataset for multi-task document-level information extraction”. In: Inf. Process. Manag. 58.4 (2021), p. 102563. doi: 10.1016/J.IPM.2021.102563. [139] Dingwen Zhang, Le Yang, Deyu Meng, Dong Xu, and Junwei Han. “SPFTN: A Self-Paced Fine-Tuning Network for Segmenting Objects in Weakly Labelled Videos”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 5340–5348. doi: 10.1109/CVPR.2017.567. [140] Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. “Position-aware Attention and Supervised Data Improve Slot Filling”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017). 2017, pp. 35–45. url: https://nlp.stanford.edu/pubs/zhang2017tacred.pdf. [141] Jiawei Zhao, Steve Dai, Rangharajan Venkatesan, Brian Zimmer, Mustafa Ali, Ming-Yu Liu, Brucek Khailany, William J. Dally, and Anima Anandkumar. “LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update”. In: IEEE Transactions on Computers 71.12 (2022), pp. 3179–3190. doi: 10.1109/TC.2022.3202747. 102 Appendix A Max-Sum Networks Theorem Proofs A.1 Proof of Lemma 1 proof of Lemma 1 Let N(x) be a neural network that uniformly approximates h(x) : [0, 1] → R to ϵ accuracy for any ϵ > 0. For the sake of contradiction, suppose that there is a set of nonzero measure I such that h ′ (x) > b a.e. for x ∈ I. Without any loss of generality, we can assume I = [c, d] is a closed interval. It follows from the condition h ′ (x) > b that h(d) = h(c) + b(d − c) + δ (A.1) for some constant δ > 0. Since N(x) is at most ϵ away from h(a) it follows that N(c) < h(c) + ϵ. Thus, from the continuity of N(x) as well as the condition N′ (x) ≤ b N(d) < h(a) + ϵ + (d − c)b. (A.2) Thus: h(d) − N(d) > δ − ϵ (A.3) 103 When ϵ < δ 2 , h(d) − N(d) > δ 2 > ϵ, which contradicts the fact that N(x) uniformly approximates h(x) to ϵ accuracy. Similarly, it can be shown that h ′ (x) ≥ a a.e. Proposition 1 A single-input single-output, sum-product Network with Relu activation has its second derivative equal to 0 a.e. proof Let σ = max(0, x) be the ReLu function and Nd(x) be the resulting d-hidden layer network. Proceed by induction on the number of hidden layers. For one hidden layer network with width n, we have N1(x) = Xn i=1 w [o] i σ(w [1] i x) (A.4) So that N ′′ 1 (x) = Xn i=1 w [o] i (w [1] i ) 2σ ′′(w [1] i x) (A.5) = 0 (A.6) where the last equality holds a.e. since σ ′′(x) = 0 except when x = 0. For the inductive step, note that if the final hidden layer has width n, then Nd(x) = Xn i=1 w [o] i σ(gi(x)) (A.7) 104 where w [o] i are the weights at the output layer, and g1(x), ..., gn(x) can be thought of sub-networks of Nd(x) with d − 1 hidden layers, and thus by the inductive hypothesis satisfy g ′′ i (x) = 0 a.e. The next two derivatives of this network are computed as N ′ d (x) = Xn i=1 w [o] i σ ′ (gi(x))g ′ i (x) (A.8) N ′′ d (x) = Xn i=1 w [o] i (σ ′′(gi(x))g ′ i (x) 2 + σ ′ (gi(x))g ′′ i (x)) (A.9) The first term in the summation is 0 except when gi(x) = 0 which happens finitely many times, and thus this term is equal to 0 a.e. The second term in the summation is 0 a.e. by the inductive hypothesis. Thus, N′′ d (x) = 0 a.e., completing our induction. A.2 Proof of Theorem 1 To prove Theorem 1, the following lemma is needed, Lemma 2 For any constants a1, a2, b1, b2, ∃w0, w1 such that (a1 + (b1 ∨ x)) ∨ (a2 + (b2 ∨ x)) = w0 ∨ (w1 + x) proof We first note that (a1 + (b1 ∨ x)) ∨ (a2 + (b2 ∨ x)) = a1 + (b1 ∨ x) ∨ (a2 − a1 + b2 ∨ x) = a1 + (b1 ∨ x) ∨ (c + (b2 ∨ x)). Set g(x) = a1 + (b1 ∨ x) ∨ (c + (b2 ∨ x)) (A.10) We hope to show that g(x) = w0 ∨ (w1 + x) for some w0, w1 which will depend on a1, b1, b2, c. We now have the following cases based on these values. 105 Case 1: b1 > b2, c < 0. When x ≤ b2, we have f(x) = a1 + b1 ∨ (b2 + c) = a1 + b1. When x ∈ (b2, b1), we have f(x) = a1 + b1 ∨ (x + c) = a1 + b1. Finally, when x ≥ b1, we have f(x) = a1 + x ∨ (x + c) = a1 + x. Thus we can set w0 = a1 + b1 and w1 = a1, completing this case. Case 2: b1 > b2, c > 0. When x ≤ b2, we have f(x) = a1 + b1 ∨ (b2 + c). When x ∈ (b2, b1), we have f(x) = a1 + b1 ∨ (x + c). Finally, when x ≥ b1, we have f(x) = a1 + x ∨ (x + c) = x + c + a1. Thus we can set w0 = a1 + b1 ∨ (b2 + c) and w1 = c + a1, completing this case. Since g(x) = a1 + (b1 ∨ x) ∨ (c + (b2 ∨ x)) = a1 + c + (−c + (b1 ∨ x)) ∨ (b2 ∨ x), the lemma also holds when b2 ≥ b1. This completes all the cases. proof of Theorem 1 We can prove this by induction on the number of hidden layers. Consider a one hidden layer network with n computation units. The resulting function is of the form N1(x) = b [o] ∨ _n i=1 w [o] i + b [1] i ∨ (w [1] i + x) ! (A.11) for some real valued constants b [o] , w [o] i , w [1] i , b[1] i for i ∈ [1, n]. First note that b∨(w+x) = w+(b−w)∨x. Then N1(x) = b [o] ∨ _n i=1 (w [o] i + w [1] i + (b [1] i − w [1] i ) ∨ x !! (A.12) We can now apply Lemma 2 to equation (A.12) repeatedly to complete our base case. For the inductive step, suppose that the theorem is true for d − 1 hidden layers. For a network with d layers, suppose that the d-th hidden layer had n computation units. Denote the output of 106 each computation unit in the d-th hidden layer as gi(x) for i ∈ [1, n]. Now gi(x) can be thought of as a neural network with d − 1 hidden layers. It is clear from these definitions that Nd(x) = b [o] ∨ _n i=1 w [o] i + gi(x) ! (A.13) for some real valued constants b [o] , w [o] i for i ∈ [1, n]. By the inductive hypothesis, each gi(x) = wi ∨ (w ′ i + x) for some values wi , w ′ i . As with the base case, Lemma 2 can be repeatedly applied to equation (A.13) to complete the inductive step. A.3 Proof of Theorem 2 To prove Theorem 2, we first need the following lemma, analogous to lemma 2. Lemma 3 Consider the class of functions F := {c1, c2 − x, c3 + x, c1 ∨ (c2 − x), c1 ∨ (c3 + x), (c2 − x) ∨ (c3 + x), c1 ∨ (c2 − x) ∨ (c3 + x)|c1, c2, c3 ∈ R} (A.14) Then ∀f, g ∈ F, we have f ∨ g ∈ F. proof This is trivial to verify except when f or g takes on the form (c2 − x) ∨ (c3 + x), c1 ∨ (c2 − x) ∨ (c3 + x. To address this case, let σa1,a2,a3 (x) = (a1 − x) ∨ a2 ∨ (a3 + x). Then it is sufficient to verify the following properties 1. σa1,a2,a3 (x) ∨ (b − x) = σc1,c2,c3 (x) for some c1, c2, c3 ∈ R. 2. σa1,a2,a3 (x) ∨ (b + x) = σc1,c2,c3 (x) for some c1, c2, c3 ∈ R. 3. σa1,a2,a3 (x) ∨ b = σc1,c2,c3 (x) for some c1, c2, c3 ∈ R. 107 4. σa1,a2,a3 (x) ∨ σb1,b2,b3 (x) = σc1,c2,c3 (x) for some constants c1, c2, and c3. Property (1) can be proved as follows, σa1,a2,a3 (x) ∨ (b − x) (A.15) = ((a1 − x) ∨ (b − x)) ∨ a2 ∨ (a3 + x) = ((a1 ∨ b) − x) ∨ a2 ∨ (a3 + x) = σa1∨b,a2,a3 (x) Properties (2) and (3) can be treated similarly. Property (4) follows from basic properties of a ∨ b, σa1,a2,a3 (x) ∨ σb1,b2,b3 (x) (A.16) = ((a1 − x) ∨ (b1 − x)) ∨ (a2 ∨ b2) ∨ ((a3 + x) ∨ (b3 + x)) Thus, letting c1 = a1 ∨ b1, c2 = a2 ∨ b2, c3 = a3 ∨ b3 gives us the desired result. With this lemma we can now prove Theorem 2. proof of Theorem 2 Since all the functions in F as given in Lemma 3 have their derivatives in the set {−1, 0, 1} a.e., it is sufficient to use induction on the number of hidden layers to prove that the neural networks given by the theorem statement belong in F. Consider a single input single output, one hidden layer network with n computation units. The resulting function is of the form N1(x) = b [o] ∨ _n i=1 a [o] i ( w [o] i + b [1] i ∨ a [1] i (w [1] i + x) ) ! (A.17) 108 for some real valued constants b [o] , b [1] i , a [o] i , a [1] i , w [o] i , w [1] i for i ∈ [1, n]. Since a [1] i (w [1] i + x) and w [o] i ∈ F as in Lemma 3, note that w [o] i + b [1] i ∨ a [1] i (w [1] i + x) ∈ F, from which it is clear that a [o] i (w [o] i + b [1] i ∨ a [1] i (w [1] i + x)) ∈ F as well. Thus, N1(x) ∈ F, completing the base case. For the inductive step, suppose that the theorem is true for d − 1 hidden layers. For a network with d layers, suppose the d-th hidden layer had n computation units. Denote the output of each computation unit in the d-th hidden layer as gi(x) for i ∈ [1, n]. Now gi(x) is effectively a neural network with d − 1 hidden layers. It is clear from these definitions that Nd(x) = b [o] ∨ _n i=1 a [o] i w [o] i + gi(x) ! (A.18) for some real valued constants b [o] , a [o] i , w [o] i for i ∈ [1, n]. From the inductive hypothesis, gi(x) ∈ F, from which a repeated application of Lemma 3 shows that Nd(x) ∈ F as well, completing our induction. A.4 Proof of Theorem 3 proof of Theorem 3 It is easy to see that N′ d (x) is undefined when σ ′ (x) is undefined. Thus there are only finitely many points for which N′ d (x) is undefined. For the values of x for which N′ d (x) is defined, we shall show that N′ d (x) ≤ 1 using induction on the number of hidden layers. Consider a single-input single-output, one hidden layer network with n computation units. The resulting function is of the form N1(x) = ln " e b [o] + Xn i=1 e w [o] i +σ ln e b [1] i +e w [1] i +x # (A.19) 109 for some real valued constants b [o] , w [o] i , b[1] i , w [1] i for i ∈ [1, n]. From equation (A.19) it follows that N ′ 1 (x) = α(x) β(x) (A.20) where α(x) = Xn i=1 " e w [o] i +σ(M(x))σ ′ (M(x)) e w [1] i +x e b [1] i + e w [1] i +x # (A.21) β(x) = e b [o] + Xn i=1 e w [o] i +σ(M(x)) (A.22) M(x) = ln e b [1] i + e w [1] i +x (A.23) Since σ ′ (x) ∈ [0, 1], it is immediately clear that N′ 1 (x) ≥ 0. N′ 1 (x) < 1 follows from the following inequalities. e w [1] i +x < eb [1] i + e w [1] i +x (A.24) and Xn i=1 h e w [o] i +σ(M(x))σ ′ (M(x))i < eb [o] + Xn i=1 e w [o] i +σ(M(x)) (A.25) 110 completing the base case. For the inductive step, suppose the theorem is true for d − 1 hidden layers. Consider a network with d hidden layers where the dth hidden layer has n computation units. Denote the output of each computation unit in the d-th hidden layer as gi(x) for i ∈ [1, n]. Now gi(x) is effectively a neural network with d − 1 hidden layers. Since there is no activation at the output layer, Nd(x) has the following form Nd(x) = ln " e b [o] + Xn i=1 e w [o] i +σ(gi(x))# (A.26) for some real valued constants b [o] , w [o] i for i ∈ [1, n]. From equation (A.26) it follows that N ′ d (x) = Pn i=1 e w [o] i +σ(gi(x))σ ′ (gi(x))g ′ i (x) e b [o] + Pn i=1 e w [o] i +σ(gi(x)) (A.27) From the inductive hypothesis, g ′ i (x) ≥ 0. Therefore N′ d (x) ≥ 0. N′ d (x) < 1 follows from the following inequality which is trivially true since g ′ i (x) < 1, e w [o] i +σ(gi(x))σ ′ (gi(x))g ′ i (x) < ew [o] i +σ(gi(x)) (A.28) This completes the proof. 111 Appendix B FIRE Dataset Statistics and Experimental Details Number of Entity Mentions 15,334 Average number of entities per instance 5.29 Amount of each entity Company 22.41% FinancialEntity 15.60% Date 15.37% Designation 8.08% Money 7.78% Action 5.57% Quantity 5.27% Product 4.39% Sector 3.90% Location 3.74% Person 3.41% BusinessUnit 2.71% GeopoliticalEntity 1.70% Table B.1: FIRE Dataset Entity Statistics B.1 Distribution of entity and relation types in FIRE Table B.1 breaks down the quantity of each entity type in the dataset while Table B.2 displays the same information but for relations. For a detailed description of each entity and relation type, see the annotation guidelines document accompanying the dataset. 112 Number of Relation Mentions 8,366 Average number of relations per instance 2.92 Amount of each relation Valuein 11.17% Value 9.98% Designation 9.95% Actionto 8.55% Actionin 6.35% Propertyof 6.33% Locatedin 6.06% Sector 5.76% Productof 5.71% Constituentof 5.27% Employeeof 4.67% ValueChangeIncreaseby 4.31% ActionBuy 3.87% ValueChangeDecreaseby 3.64% Subsidiaryof 3.16% Quantity 3.08% ActionSell 1.66% ActionMerge 0.40% Table B.2: FIRE Dataset Relation Statistics Model Learning Rate (NER) Batch Size (NER) Learning Rate (RE) Batch Size (RE) SpERT — — 5e-5 2 PL-Marker 7e-5 2 4e-6 2 REBEL — — 3e-6 4 Table B.3: Selected hyper-parameters for standard fine-tuning. Note that PL-Marker has a separate training run for its NER module. Therefore, we search for the learning rate and batch size of this module as well. B.2 Hyper-parameter Selection Model Learning Rate Batch Size Tier 1-3 Tier 4-6 Tier 7-9 Final SpERT 8e-6 5e-5 3e-5 5e-5 8 PL-Marker 7e-6 4e-5 4e-5 1e-6 4 REBEL 5e-6 4e-5 3e-5 1e-6 4 Table B.4: Hyper-parameters for curriculum learning experiments. Note that for PL-Marker, we apply curriculum learning on the RE module only. For the NER module, we fix the learning rate to 5e − 5 and the batch size to 4. 113 For our experiments, we allocated a tuning budget of 24 hours on an Nvidia GeForce RTX 2080 Ti GPU for each RE-specialized model to search for the optimal hyper-parameters on the validation set. Table B.3 displays the selected hyper-parameters for SpERT, PL-Marker and REBEL in the standard fine-tuning experiments. Table B.4 presents the hyper-parameters for the curriculum learning experiments for the REspecialized models. To reduce the search space, instead of searching for one learning rate for each data tier, we select a fixed learning rate for tiers 1 to 3, 4 to 6 and 7 to 9. Thus we search for only three learning rates for all tiers, in addition to the final learning rate for training on the whole dataset. B.3 Llama 2-7b and GPT 3.5 Prompts B.3.1 Few-Shot Learning Prompts For few-shot learning, the following 1-shot prompt was used: Find the relation between the entities given in the context and produce a list of triplets containing two entities and their relations. Only find out the following relations ActionBuy, Actionin, ActionSell, ActionMerge, Actionto, Constituentof , Designation, Employeeof, Locatedin, Productof, Propertyof, Quantity, Sector, Subsidiaryof, Value, ValueChangeDecreaseby, ValueChangeIncreaseby and Valuein ActionMerge indicate two company or organizations enters into merger agreements to form a single entity. 114 ActionBuy represents the action of purchasing/acquiring a Company, FinancialEntity, Product, or BusinessUnit by a Company or a Person. Actionto represents the relation between the action entity and the entity on which the action has taken. Constituentof relation denotes one financial entity is part of another financial entity. Actionin indicates the Date associated with an Action entity, signifying the time of occurrence of the action. ActionSell represents the action of selling a Company, FinancialEntity, Product, or BusinessUnit by a Company or a Person. Employeeof denotes the past, present or future employment relationship between a Person and a Company. Designation indicates the job title or position of a Person, or the Designation of a Company in the financial context, providing information about the role or responsibility of the entity. Locatedin indicates the geographical location or country associated with an entity, specifying the place or region where the entity is located. Money and Quantity can be in the place where they were generated, lost, profited, etc. Note that a Company is only Located in a place if it based in that place. Productof indicates a Product is manufactured, sold, offered, or marketed by a Company, establishing a relationship between the Company and the Product. Propertyof serves as an umbrella relation” that indicates a general association between two entities, mainly representing ownership or part-of/composition relationships. This relation is used to connect two entities when a more specific relation is not yet defined. 115 Quantity represents the countable quantity a FinancialEntity, BusinessUnit or Product. Sector indicates the economic sector or industry to which a Company belongs, providing information about the broad business area or category of the Company’s operations. Subsidiaryof indicates that a Company is a subsidiary of a parent Company, either wholly or majority owned. Note that ”brands” are always considered subsidiaries of their parent Company. A highly occurring pattern is a parent company selling its subsidiary company, in which case the Subsidiaryof relation is not annotated. Value represents a non-countable value of a FinancialEntity, BusinessUnit or Product such as a monetary value or a percentage. A Company can also have a Value relation, but only for monetary values such as indicating the net worth of a company or the sale price in an acquisition. ValueChangeDecreaseby indicates the decrease in monetary value or quantity of a FinancialEntity. An additional more rare use-case is the Quantity of a BusinessUnit decreasing, such as number of employees or number of offices. ValueChangeIncreaseby indicates the increase in value or quantity of a FinancialEntity. An additional more rare use-case is the Quantity of a BusinessUnit increasing, such as number of employees or number of offices. Valuein indicates the Date associated with a Money or Quantity entity, providing information about the specific time period to which the Money or Quantity value is related. Please find few examples below Context : Bank of America to Buy Merrill Lynch for $50 Billion Answer : [[’Bank of America’, ’Merrill Lynch’, ’ActionBuy’], [’Buy’, ’Merrill Lynch’, ’Actionto’], [’Merrill Lynch’, ’$50 Billion’, ’Value’]] 116 B.3.2 Fine-Tuning Prompts For fine-tuning, the dataset examples were transformed to the following prompt which was used to train the models: Question: Find the relation between the entities given in the context and produce a list of triplets containing two entities and their relations. Only find out the following relations: ActionBuy, Actionin, ActionSell, ActionMerge, Actionto, Constituentof, Designation, Employeeof, Locatedin, Productof, Propertyof, Quantity, Sector, Subsidiaryof, Value, ValueChangeDecreaseby, ValueChangeIncreaseby, and Valuein. Context: Bank of America to Buy Merrill Lynch for $50 Billion Answer: [[’Bank of America’, ’Merrill Lynch’, ’ActionBuy’], [’Buy’, ’Merrill Lynch’, ’Actionto’], [’Merrill Lynch’, ’$50 Billion’, ’Value’]] 117 Appendix C FIRE Dataset Annotation Guidelines This is the annotation guidelines for the FIRE (FInancial Relation Extraction) dataset. The dataset can be found at https://github.com/**. The main objective of this document is to provide a comprehensive manual that outlines the standards and practices for annotating entities and relations in the FIRE dataset. As such, this document is pivotal to understanding and maintaining the consistency and quality of annotations in the dataset. The document is divided into six key sections. In Section 1, we provide a summary list of all entity types and relation types included in the FIRE dataset. Section 2 and 3 offer definitions and basic examples of the entities and relations that can be found in the dataset, respectively. In Section 4 and 5, we delve into more intricate use cases, covering edge cases and guidelines for conflict resolution when labeling entities and relations, respectively. Lastly, in Section 6, we present a set of general entity and term annotation guidelines, some of which are based on the ACL RD-TEC guidelines ∗ . ∗https://aclanthology.org/L16-1294/ 118 C.1 Summary List of Named Entities and Relations 13 Entity Types & 18 Relation Types Entity Type Action BusinessUnit Company Date Designation FinancialEntity GeopoliticalEntity Location Money Person Product Quantity Sector Relation Type ActionBuy Actionin ActionMerge ActionSell Actionto Constituentof Designation Employeeof Locatedin Productof Propertyof Quantity Sector Subsidiaryof Value ValueChangeDecreaseby Valuein ValueChangeIncreaseby 119 C.2 Definition of Named Entities Entity 1: Action (AC) ⋆ Definition: Refers to any activity or event that signifies a transaction or occurrence in the financial market, such as company acquisitions, share purchases, mergers, initial public offerings (IPOs), and other similar events. ⋆ Example entities: {acquire; sell; buy; co-founded; acquisition; IPO; merger; spin off; separated; launch; released; invest; takeover} ⋆ Example Sentence Mention: “Apple is reported to acquire AI startup Voxun for $10 million.” Entity 2: BussinessUnit (BU) ⋆ Definition: Represents a distinct unit, division or department within a company, excluding subsidiaries, and can include facilities, stores, or company assets, as well as employees associated with that unit or department. ⋆ Example entities: {semiconductor division, Staples center, marketing department, Supermarket, manufacturing plants, etc.} ⋆ Example Sentence Mention: “The Data&AI team in Microsoft’s Cloud and AI division is a group of passionate scientists and engineer” Entity 3: Company (CO) ⋆ Definition: Denotes the official or unofficial name of a registered company or a brand. ⋆ Example entities: {Apple Inc.; Uber; Bank of America} ⋆ Example Sentence Mention: “Walmart’s headquarters is located in Bentonville, Arizona.” 120 Entity 4: Date (DA) ⋆ Definition: Represents a specific time period, whether explicitly mentioned (e.g., “year ended March 2020”) or implicitly referred to (e.g., “last month”), in the past, present, or future. ⋆ Example entities: {June 2nd, 2010; quarter ended 2021; last week; prior year; Wednesday} ⋆ Example Sentence Mention: ”Revenue increased by $3 million and $2 million in fiscal years 2016 and 2015, respectively.” Entity 5: Designation (DE) ⋆ Definition: Refers to an official position held within a company, such as CEO, a job description, such as analyst, or any role played in a financial transaction, including customers, sellers, buyers, third parties, trustees, and other similar roles. (Note: Employees are categorized as BusinessUnit, and companies may have Designations, e.g., a company is a Trustee) ⋆ Example entities: {CEO; President; Board of Directors; employees; analyst; physician; Professor; customer; seller; stockholder; trustee; third party} ⋆ Example Sentence Mention: “Information about our directors and officers is incorporated by reference from the sections proposal one.” Entity 6: FinancialEntity (FE) ⋆ Definition: Represents an entity that possesses a monetary, countable, or quantifiable value within the context of finance, including assets, liabilities, investments, and other similar entities. ⋆ Example entities: {Stock; revenue; profit; stake; taxes; loss; grant date fair value; amount outstanding; assets } 121 ⋆ Example Sentence Mention: “The growth in revenue was primarily the result of increased sales of our graphics processor.” Entity 7: GeopoliticalEntity (GE) ⋆ Definition: Denotes countries or political regions, even when used as adjectives. ⋆ Example entities: {U.S.; Japanese; China; European Union; German} ⋆ Example Sentence Mention: “Finnish Aldata Solution has signed a contract of supply its G.O.L.D system to retail chains located in France.” Entity 8: Location (LO) ⋆ Definition: Represents geographical locations that are not political regions/countries, such as regions, states, cities, roads, or any other location. ⋆ Example entities: {California; Paris; 1280 W 12th Blvd; Americas; Europe} ⋆ Example Sentence Mention: “Elcoteq Dongguan was established in 1999 in Dongguan, China and is one of Elcoteq’s four volume manufacturing plants in the Asia - Pacific region.” Entity 9: Money (MO) ⋆ Definition: Denotes a monetary value expressed in any world currency, including digital currencies. ⋆ Example entities: {$76.3 million; $4 Bn; Rs 33.80 crore; 1.2 BTC} ⋆ Example Sentence Mention: “Revenue increased by $3 million and $2 million in fiscal years 2016 and 2015, respectively.” 122 Entity 10: Person (PE) ⋆ Definition: Represents the name of an individual. ⋆ Example entities: {Meg Whitman; Mr. Baker; Warren Buffet} ⋆ Example Sentence Mention: “Sharmistha Dubey is the 51-year-old CEO of the Match Group.” Entity 11: Product (PR) ⋆ Definition: Refers to any physical object or service manufactured or provided by a company to consumers, excluding references to businesses or sectors within the financial context (which would labeled as Sector ). ⋆ Example entities: {iPhone; Tesla model X; cloud services; Microsoft Windows 10; laptops; medical equipment; computer software; online classes; eye surgery} ⋆ Example Sentence Mention: “Fitbit Inc. designs wearable fitness trackers and smartwatches, while also providing personalized fitness tracking services.” Entity 12: Quantity (QU) ⋆ Definition: Represents any numeric value that is not categorized as Money, such as percentages, numbers, measurements (e.g., weight, length), or other similar quantities. Note that unit of measurements are also part of the entity. ⋆ Example entities: {15 %; 25,000 units; 2.75in; 100 tons} ⋆ Example Sentence Mention: “Profit increased by 15% compared to last quarter, primarily driven by the selling of 5,000 more laptops.” 123 Entity 13: Sector (SE) ⋆ Definition: Denotes any economic area where businesses compete for products, activities, or services, or any area explicitly stated as a business, segment, or sector. Also, more generic entities such as manufacturer, designer, developer are considered Sector entities as well. ⋆ Example entities: {Tech; Healthcare; Automotive; Manufacturer; Retail; Supermarket chains; Software startup; Phone business;} ⋆ Example Sentence Mention: “Drugmaker Sanofi SA said on Tuesday it will buy Amunix Pharmaceuticals Inc.” 124 C.3 Definition of Relations Relation 1: ActionBuy ⋆ Definition: Represents the action of purchasing/acquiring a Company, FinancialEntity or BusinessUnit by a Company or a Person. Company Person Company FinancialEntity BusinessUnit Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Apple is reported to acquire AI startup Vocum for $10 million . (CO) (AC) (SE) (CO) (MO) ActionBuy Relation 2: Actionin ⋆ Definition: Indicates the Date associated with an action such as an acquisiton, a merger or an IPO. 125 Action Date Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Michael Eisner Sells Trading Card Company Topps to Fanatics for $ 500 Million in June of last year . (PE) (AC) (SE) (CO) (CO) (MO) (DA) Actionin Relation 3: ActionMerge ⋆ Definition: Connects two Company entities that entered into a merger agreement. This relation is symmetric. For annotation purposes, the head entity is always the first entity to appear in text. Company Company Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Symmetra Financial announced that it entered into a merger agreement with Sumitomo Life Insurance . (CO) (AC) (CO) ActionMerge 126 Relation 4: ActionSell ⋆ Definition: Represents the action of selling a Company, FinancialEntity, or BusinessUnit by a Company or a Person. Company Person Company FinancialEntity BusinessUnit Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Michael Eisner Sells Trading Card Company Topps to Fanatics for $ 500 Million in June of last year . (PE) (AC) (SE) (CO) (CO) (MO) (DA) ActionSell Relation 5: Actionto ⋆ Definition: Connects an Action to the entity receiving the action, e.g. the entity being acquired, sold, etc. 127 Action Company FinancialEntity BusinessUnit Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Michael Eisner Sells Trading Card Company Topps to Fanatics for $ 500 Million in June of last year . (PE) (AC) (SE) (CO) (CO) (MO) (DA) Actionto Relation 6: Constituentof ⋆ Definition: Indicates a part-of/composition relationship. Serves two purposes, to label a businessunit/division being part of a company (or another businessunit) and to decompose long entities into separate components. 128 FinancialEntity BusinessUnit FinancialEntity BusinessUnit Company Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) We classify interest and expense associated with our uncertain tax positions as a component of income tax expense . (FE) (FE) Constituentof Relation 7: Designation ⋆ Definition: Indicates the job title or position of a Person, or the Designation of a Company in the financial context, providing information about the role or responsibility of the entity. (Do not confuse the Designation entity (see 5) with this Designation relation). 129 Person Company Designation Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Certification of Jeffrey P. Bezos , Chairman and Chief Executive Officer of Amazon.com , Inc. . (PE) (DE) (DE) (CO) Designation Designation Relation 8: Employeeof ⋆ Definition: Denotes the past, present or future employment relationship between a Person and a Company or a BusinessUnit inside the Company. Person Company BusinessUnit Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) 130 Certification of Jeffrey P. Bezos , Chairman and Chief Executive Officer of Amazon.com , Inc. . (PE) (DE) (DE) (CO) Employeeof Relation 9: Locatedin ⋆ Definition: Indicates the geographical location or country associated with an entity, specifying the place or region where the entity is located. Money and Quantity can be Located in the place where they were generated, lost, profited, etc. Note that a Company is only Located in a place if it based in that place. Company BusinessUnit Location Money Quantity Location GeopoliticalEntity Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) The Ziegenfelder Company has been sold to the Armstrong Group out of Butler , Pennsylvania . (CO) (AC) (CO) (LO) (LO) Locatedin Locatedin 131 Relation 10: Productof ⋆ Definition: Indicates that a Product is manufactured, sold, offered, or marketed by a Company, establishing a relationship between the Company and the Product. Product Company Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Polaris Inc. is an American manufacturer of motorcycles , snowmobiles , and all-terrain vehicles . (CO) (GE) (PR) (PR) (PR) Productof Productof Productof Relation 11: Propertyof ⋆ Definition: Representing the connection or ownership of a FinancialEntity. 132 FinancialEntity Company Person BusinessUnit Product Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Shares of Tesla dropped 14% over the last quarter . (FI) (CO) (QU) (DA) Propertyof 133 Relation 12: Quantity ⋆ Definition: Represents the countable quantity a FinancialEntity, BusinessUnit or Product. (For all non-countable numerical values, this is covered by the Value entity, see 15) FinancialEntity BusinessUnit Product Quantity Valid Head Entities Valid Tail Entities 134 ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Sales of Yamaha motorbikes also went up from 10,524 units to 12,193 units witnessing increase of 15 % . (FE) (CO) (PR) (QU) (QU) (QU) Quantity Quantity Relation 13: Sector ⋆ Definition: Indicates the economic sector or industry to which a Company belongs, providing information about the broad business area or category of the Company’s operations. Note: do not confuse the Sector entity (see 13) with this Sector relation. Company Sector Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) British money manager Schroders plc to buy renewable investor Greencoat Capital Holdings Limited . (GE) (SE) (CO) (AC) (SE) (CO) Sector Sector Relation 14: Subsidiaryof ⋆ Definition: Indicates that a Company is a subsidiary of a parent Company, either wholly or majority owned. Note that ”brands” are always considered subsidiaries of their parent Company. A highly occurring pattern is a parent company selling its subsidiary company, in which case the Subsidiaryof relation is not annotated (see item 2 in relation box 11) 135 Company Company Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Facebook parent Meta names DoorDash CEO Tony Xu as a director . (CO) (CO) (CO) (DE) (PE) (DE) Subsidiaryof 136 Relation 15: Value ⋆ Definition: Represents a non-countable value of a FinancialEntity, BusinessUnit or Product such as a monetary value or a percentage. A Company can also have a Value relation, but only for monetary values such as indicating the net worth of a company or the sale price in an acquisition. BusinessUnit Company FinancialEntity Money Quantity Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) 137 In fiscal years 2014 , 2015 revenue jumped by 15% and 13% to $3 million and $2.8 million , respectively. (DA) (DA) (FE) (QU) (QU) (MO) (MO) Value Value Relation 16: ValueChangeDecreaseby ⋆ Definition: Indicates the decrease in monetary value or quantity of a FinancialEntity. FinancialEntity Money Quantity Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) Sales of Bullet 350 dipped 23.08 % or 8,061 units as did sales of Electra 350 that fell by 7.32 % or 4,521 units . (FE) (PR) (QU) (QU) (FE) (PR) (QU) (QU) ValueChangeDecreaseby ValueChangeDecreaseby ValueChangeDecreaseby ValueChangeDecreaseby Relation 17: Valuein ⋆ Definition: Indicates the Date associated with a Money or Quantity entity, providing information about the specific time period to which the Money or Quantity value is related. 138 Money Quantity Date Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) In fiscal years 2014 , 2015 revenue jumped by 15% and 13% to $3 million and $2.8 million , respectively. (DA) (DA) (FE) (QU) (QU) (MO) (MO) Valuein Valuein Valuein Valuein Relation 18: ValueChangeIncreaseby ⋆ Definition: Indicates the increase in value or quantity of a FinancialEntity. FinancialEntity Money Quantity Valid Head Entities Valid Tail Entities ⋆ Example Sentence Mention: (Note that only the considered relation is added below for clarity) 139 In fiscal years 2014 , 2015 revenue jumped by 15% and 13% to $3 million and $2.8 million , respectively. (DA) (DA) (FE) (QU) (QU) (MO) (MO) ValueChangeIncreaseby ValueChangeIncreaseby 140 C.4 Detailed Entity Annotation Rules 141 Entity 1: Action (AC) 1. Any verb/action outside the financial market context is not an Action entity, even if it’s highlighting a new event that occurred. Note in the example below, the verbs “announced” and “entered” are generic verbs and do not constitute an Action entity, whereas “merger agreement” is indeed an event in the financial market context. Symmetra Financial announced that it entered into a merger agreement with Sumitomo Life Insurance . (CO) (AC) (CO) ActionMerge Actionto 2. Purchasing or a Product or placing an order for a Product is not an Action entity Intel sells mobile chips . (CO) (PR) Productof UPS to purchase 19 Boeing 747 Freighters (CO) (QU) (CO) (PR) Productof Quantity 3. The Action of buying/selling of FinancialEntities such as shares or ownership interest should be labeled as Action entity. Brion Inc. repurchased 3.9 million shares of the company’s common stock at an average price of $75.29 per share . (CO) (AC) (QU) (FE) (FE) ActionBuy Actionto Quantity Constituentof Value 4. Action entities can take different parts-of-speech, i.e. an Action entity is not necessarily a verb (e.g. acquisition, merger and sale are nouns). 142 Prior year benefits of $6 million in connection with the sale of the India subsidiary . (DA) (FE) (MO) (AC) (GE) Value Valuein Actionin 143 Entity 2: BusinessUnit (BU) 1. BusinessUnit entities are a separate department/division in a company or an asset/resource of a company, whether a human resource (e.g. employees) or a physical resource (e.g, facility/center/venue/arena or hardware systems/assets) Intel completed the acquisition of the Wireless Solutions (WLS) division of Infineon early in 2011 for US $1.4 billion . (CO) (AC) (BU) (CO) (DA) (MO) Actionto ActionBuy ActionSell Actionin Value Valuein We own two production facilities in the southeast region . (QU) (BU) (LO) Quantity Locatedin Apple had approximately 154,000 full-time equivalent employees , among which 80,000 in the US . (CO) (QU) (BU) (QU) (GE) Constituentof Quantity Quantity Locatedin 2. BusinessUnit as Segment: A segment of a company that operates as a distinct business unit and is reported separately in the company’s financial statements, with its own financial metrics such as revenue, operating income, and net income, should be labeled as a BusinessUnit entity. Examples of segments include “wholesale segment” and “television segment”. Gross profit for our international wholesale segment decreased $ 29.7 million for 2012 . (FE) (BU) (MO) (DA) Propertyof ValueChangeDecreaseby Valuein 144 3. BusinessUnit vs Location: When a department/segment of a Company is referred to by its location then it is to be annotated as a BusinessUnit entity. Note in the sentence below, “Asia Pacific” is annotated as a BusinessUnit entity. Mr. Harris will be appointed as head of Asia Pacific at Johnson Controls (PE) (DE) (BU) (CO) Designation Employeeof Constituentof 4. BusinessUnit vs Sector : When referring to a business/market of a Company, the context should determine if it is referred to as department inside the Company (and therefore annotated as a BusinessUnit) or it is being referred to as the business in which the Company operates in (and therefore annotated as a Sector entity). Note in the first example below how “naval business” is being sold, indicating it is a division of the company and therefore labeled as a BusinessUnit. In the second and third example, on the other hand, the phrases are describing the industry of the company and therefore labeled as a Sector entity. We entered into an agreement to sell Wallop ’s naval business . (AC) (CO) (BU) Constituentof Actionto Coric recently started expanding into auto business industry . (CO) (SE) Sector Ingles Markets, Incorporated (ingles or the Company) is a leading supermarket chain in the southeast United States . (CO) (SE) (LO) Sector 5. BusinessUnit vs Product: A program/initiative/plan of a Company is to be always annotated as a BusinessUnit entity, even if it is named after a Product of the Company. 145 Cardium Therapeutics to advance the Generx program . (CO) (BU) Constituentof Entity 3: Company (CO) 1. A brand, and even a trademark, of a Company is considered a separate Company and is annotated as such. A Subsidiaryof relation will exist between the brand and the parent Company. Rebel Foods operates 9 sub - brands including Faasos and Behrouz Biryani . (CO) (QU) (CO) (CO) Subsidiaryof Subsidiaryof 2. Overlap Company & GeopoliticalEntity/Location: When a country/location is part of a Company name, e.g. “Intel Germany”, annotate the phrase including the country/location as a Company entity. No GeopoliticalEntity/Location should be annotated and thus no Locatedin relation exists. Note in the first sentence below where this pattern occurs. Conversely, note that in the second sentence how “Finnish” is not part of the Company name so it is annotated as a separate GeopoliticalEntity and a Locatedin relation exists. Rallis India Ltd , a subsidiary of Tata Chemicals , on Wednesday reported a 13.3 per cent fall in its consolidated net profits . (CO) (CO) (DA) (QU) (FE) Subsidiaryof ValueChangeDecreaseby Propertyof Finnish Aldata Solution has signed a contract of supply its G.O.L.D. system to two French retail chains . (GE) (CO) (PR) (QU) (GE) (SE) Productof Locatedin 3. Overlap Sector & Company: Refer to item 5 in entity box 11 (Product) 146 Entity 4: Date (DA) 1. Time-related prefixes/suffixes such as “years ended, fiscal year, last quarter, beginning of” should be included in the Date entity. The segment ’s net sales for the fiscal years ended January 31 , 2010 , 2009 and 2008 , were $ 100B , $ 98B and $ 90B , respectively . (FE) (DA) (DA) (DA) (MO) (MO) (MO) Value Value Value Valuein Valuein Valuein 2. Time duration is not a Date entity. Boeing ’s lease agreement has a period of 5 months . (CO) Mr. Ekerman has been with Rasin for 10 years . (PE) (CO) Employeeof 3. “year-over-year”, “y-o-y”, “yearly” are not Date entities. Core net income (non - GAAP) decreased 45.7 % year - over - year to RMB50.2 million ( US $ 7.8 million ) . (FE) (QU) (MO) ValueChangeDecreaseby Value 4. Implicit/reference dates (dates that do not point directly to a point in time) are considered Date entities, e.g. “to this date”, “prior year”, “as of today”. 15,000 shares were distributed on the date of grant . (QU) (FE) (DA) Quantity Valuein Entity 5: Designation (DE) 1. Designation can be a generic role such as “consumer”, “third party” or “customer”. 147 NJRES markets and sells natural gas to wholesale customers . (CO) (PR) (DE) Productof 2. In addition to a Person, a Company can also take a Designation, e.g. the role of a Company in a transaction. The lawsuit alleges that SunEdison , as the Company’s controlling stakeholder breached its fiduciary duties to the minority stakeholders . (CO) (DE) (DE) Designation 3. Overlap Designation vs BusinessUnit: The job title of an employee can sometimes contain the name of a BusinessUnit, e.g. “head of Finance”. In this case, the Designation and BusinessUnit entities should be annotated separately. David M. Carroll is Senior EVP and Head of Capital Management Group of Wachovia Corporation . (PE) (DE) (DE) (BU) (CO) Designation Designation Employeeof Constituentof 4. Any job description, even outside the financial context, is a Designation entity. Adobe LiveMotiona software tool that allows professional designers to create two - dimensional Web animations . (CO) (PR) (DE) Productof Entity 6: FinancialEntity (FE) 1. Financial entities are entities that can take a monetary value. As such, financial statements and reports such as consolidated financial statement, 10-K and 10-Q are not a FinancialEntity, as they can’t take a monetary value. Notice in the below example how the “form 10-K” is actually part of a Date entity and is not a FinancialEntity by itself. 148 The dividend declared on June 14 , 2016 will be paid after the filing date of the 2016 form 10 - K and was included in other current liabilities . (FE) (DA) (DA) (FE) Propertyof 2. When to separate FinancialEntity: Financial entities occur ubiquitously in the dataset as expected, and as such it is particularly challenging to decide when to label a phrase an one FinancialEntity versus when to separate into individual ones. The general rule will be to decide based on the context: if the phrase is describing one concept as a whole or if sufficiently distinct entities can be understood. Given below is an example of each case. Note that although the phrase in the first example is shorter than the second one, it is still separated into multiple entities, i.e. length alone is not an indicator. Interest on loans and revenues totaled $ 5 million . (FE) (FE) (FE) (MO) Propertyof Propertyof Value Brion Inc. repurchased 3.9 million shares of the company’s common stock at an average price of $75.29 per share . (CO) (AC) (QU) (FE) (FE) ActionBuy Propertyof Quantity Propertyof Value 3. General “Price” of a FinancialEntity: The value of a FinancialEntity is sometimes expressed through a general price term, e.g. “average total price”, “total lump value”, etc. In this case, the general price term and the FinancialEntity are lumped together into one entity if they occur contiguously in text, e.g. “average stock price”, otherwise two separate FinancialEntities are annotated, e.g. “the total price of the Company’s ownership interest”. The two entities are then linked through a Propertyof relation. 149 The origination fee increased by 7000 shares for an aggregate adjusted total of $4 million . (FE) (QU) (FE) (FE) (MO) Propertyof Quantity Value ValueChangeIncreaseby 4. Overlap FinancialEntity & Product: In some cases, the amount of a product sold is given a value, e.g. “$4 million in GPUs sold”. In this case, two overlapping entities need to be labeled, the Product entity (“GPUs”) and the FinancialEntity (“GPUs sold”). The increase in net sales resulted from a 9.2 % increase in the pounds of poultry products sold . (FE) (QU) (FE) (PR) ValueChangeIncreaseby 5. Overlap case FinancialEntity and GeopoliticalEntity/Location: Refer to item 2 in Entity box 7 (GeopoliticalEntity) Entity 7: GeopoliticalEntity (GE) 1. Regions of a country are a Location entity not a GeopoliticalEntity, e.g. “southeast United States” Ingles Markets, Incorporated (ingles or the Company) is a leading supermarket chain in the southeast United States . (CO) (SE) (LO) Sector 2. Overlap case GeopoliticalEntity/Location and FinancialEntity: Countries/Locations when used as a modifier to a FinancialEntity, e.g. “French loans”, are not to be annotated separately. Instead, the whole phrase is annotated as one FinancialEntity and no Location or GeopoliticalEntity is added. From the recovery of Japanese taxes paid in excess of the U.S. tax rate . (FE) (FE) 3. Overlap GeopoliticalEntity vs Company: Refer to item 2 in Entity box 3 (Company) 150 Entity 8: Location (LO) 1. Streets names and numbers are Location entities. Mr. Mathew Wolsfeld , Corporate Secretary , NTI Corporation , 4201 Woodland Road , P.O. Box 69 , Circle Pines , Minnesota . (PE) (DE) (CO) (LO) (LO) (LO) Employeeof Designation Locatedin Locatedin 2. Generic locations such as “globally”, “internationally”, “around the world”, etc. are not Location entities. PG currently sells Crest Whitestrips in North America and internationally . (CO) (PR) (LO) Productof 3. Regions of a country are a Location entity, e.g “southeast United States”. Refer to item 1 in Entity box 7 (GeopoliticalEntity) 4. Location vs BusinessUnit: Refer to item 3 in Entity box 2 (BusinessUnit) 5. Overlap case Location and Company: Refer to item 2 in Entity box 3 (Company) 6. Overlap case GeopoliticalEntity/Location and FinancialEntity: Refer to item 2 in Entity box 7 (GeopoliticalEntity) Entity 9: Money (MO) 1. Modifiers/Prepositions: Include modifiers/prepositions in the Money entity which are part of the actual value such as a currency e.g. “$5 per share” or “10 million Japanese Yen”. In fiscal 2015 we paid dividends totaling $0.30 per share . (DA) (FE) (MO) Value Valuein 151 2. Inexact values: When the value of a Money or Quantity entity is not exactly mentioned, e.g. “over 20%”, “under $1 million”, “maximum of 20 crore” , etc. then include the modifier in the entity phrase. Do not add meaningless modifiers such as “approximately”, “almost”, etc. Revenue in international markets was less than 10% in the last quarter . (FE) (QU) (DA) Value Valuein 3. Two currencies: When a Money entity is listed under two different currencies, group the two currencies into one Money entity if they are contiguous in text, otherwise annotate separately. Core net income ( non - GAAP ) decreased 45.7 % year - over - year to RMB50.2 million ( US $ 7.8 million ) . (FE) (QU) (MO) ValueChangeDecreaseby Value Entity 10: Person (PE) 1. Only label named persons as a Person entity. Pronouns such as “he”,“she”,“they”, etc. should not be labeled. 2. Include a person’s title such as “Ms.”, “Mr.”, “Dr.”, etc. in the entity phrase. Ms. Hogan was appointed Executive Vice President , Human Resources in November 2014 (PE) (DE) (BU) (DA) Designation 152 Entity 11: Product (PR) Since Product and Sector entities are interrelated, the first two points help explain the distinction between the two entities. 1. The Product entity can be easily confused with the Sector entity. Some cases are easy to spot; Product tends to be something more specific, like “iPhone”, “2080Ti GPU”, while a Sector infers a more general area of products, like “Tech”, “Healthcare”. Between these two extremes, a lot of other examples tend to be much more confusing, such 153 as “semiconductor chips”, “computer services” which can reasonably be understood as being both a Product and a Sector. 2. Monetary value: To resolve the confusion, the rule will be as follows: any physical object or a service that can be sold/provided for a monetary value is considered a Product entity. Then, our previous examples of “semiconductor chips” and “computer services” should be labeled as Product entities. The exception is when a product is explicitly mentioned as a sector of a company, usually using words such as “business”, “firm”, “startup”, etc. e.g “Apple is in the phone business”, or using a word as an adjective of the company e.g. , “Intel bought the GPU startup Graphistry”. In these cases, the entities are labeled as Sector because they are explicitly referred to as a sector of a company. Had these same exact entities occur in a different context, such as in “Apple sells mobile phones”, they should be labeled as Product. In August 2016 , Intel purchased deep - learning startup Nervana Systems for over $ 400 million . (DA) (CO) (AC) (SE) (CO) (MO) Propertyof Actionin ActionBuy Valuein Sector Value 3. Services provided by a company, e.g. “cloud services”, “medical services” are a Product entity. The exception is again when the entity is explicitly stated as a Sector or is used as a Company descriptor, e.g. “the software services firm Gile Inc.” Enzo Clinical Labs is a clinical reference laboratory providing a wide range of clinical services to physicians and pharmaceutical companies . (CO) (SE) (PR) (DE) (SE) Sector Productof 4. Products: When the word “products” occurs next to an actual Product entity, add it to the entity, e.g. “poultry products” 154 Gross margin increased $ 4.8 billion or 22 % , driven by growth in server products and cloud services revenue . (FE) (MO) (QU) (PR) (PR) (FE) ValueChangeIncreaseby Propertyof Propertyof ValueChangeIncreaseby 5. Generic vs Named Products: When generic products/objects are listed in sentence without reference as being a Product to some Company, they should not be annotated as a Product entity. Compare “Company A manufactures cars and trucks.” versus “Company B sells certain automotive parts for use in cars and trucks.” In the first example, “cars” and “trucks” are Products of the Company and should be labeled as such, whereas in the second example, their mention does not warrant a Product entity label. Products which are named (not generic), e.g. “windows 7”, are labeled as a Product entity even when the manufacturing company is not mentioned in the context. Energy Services sells LNG to customers for use by trucks, drilling rigs and other motor vehicles. (CO) (PR) (DE) Productof Intel sells modem chips to be used in iphone and ipad , among others . (CO) (PR) (PR) (PR) Productof 6. Overlap case, Sector & Product: Always try to separate between Product and Sector entities, e.g. “manufacturer of eyeglasses”, label “eyeglasses” as a Product and “manufacturer” as a Sector. Amphenol , an electronic and fiber optic equipment manufacturer , is acquiring sensor manufacturer MTS Systems for $ 1.7 billion . (CO) (PR) (SE) (AC) (PR) (SE) (CO) (MO) Productof Sector ActionBuy Propertyof Sector Productof Value 7. Product and brands of Company: In a sentence where both a brand Company and its parent Company are mentioned, and where a Product entity of the brand Company exist, only annotate one Product relation between the Product 155 and the brand Company (i.e. Do not add a Productof relation between the Product and the parent Company of the brand). In the case where the brand Company name is not explicitly mentioned, add a Productof relation with the parent Company. Note in the first sentence below how the Product is associated with the brand name, while in the second sentence it is associated with the parent Company “CooperVision” since the brand name is not explicitly mentioned. CooperVision markets single - use silicone hydrogel spherical , toric and multifocal lenses under the clariti 1day brand . (CO) (PR) (CO) Subsidiaryof Productof CooperVision markets single - use silicone hydrogel spherical , toric and multifocal lenses under multiple different brands . (CO) (PR) Productof 8. Product vs BusinessUnit: Refer to item 5 in Entity box 2 (BusinessUnit) 9. Overlap Product and FinancialEntity: Refer to item 4 in Entity box 6 (FinancialEntity) Entity 12: Quantity (QU) 1. Inexact values: Refer to item 2 in Entity box 9 (Money) 2. Include additional unit descriptors which are not a different entities in themselves, e.g. “100 cm”, “2000 tons”, “500 units”. Unlike in the examples “200 shares” or “500 employees” where “shares” is a FinancialEntity and “employees” is a BusinessUnit entity. The construction of three ferryboats with an option for two more units , further confirmed at a total value of 500 mln euro ($ 645.8 mln) . (QU) (PR) (QU) (FE) (MO) Quantity Value Entity 13: Sector (SE) 156 1. Check first two points in Entity box 11 (Product) for details to clarify confusion between the Sector entity and the Product entity. 2. Common pattern: A very common pattern is as follows: “The company is a [supplier/manufacturer/seller/designer] of [general Product] such as [Specific Product 1], [Specific Product 2] and [Specific Product 3].” In this case, the [general Product] and all [specific products] are labeled as Product entities, while the Company description [supplier/manufacturer/seller/designer] is not always a Sector entity. Meaningful descriptions such as “manufacturer/designer/developer” are examples of a Sector while more general roles such as “buyer/seller” are not Sector entities. Refer to Entity box 5 (Designation) for more details. TSMC is a manufacturer of semiconductor devices such as CPUs , GPUs and microprocessors . (CO) (SE) (PR) (PR) (PR) (PR) Sector Productof Productof Productof Productof 3. Always add modifiers/descriptors to Sector entity, such as “business, segment, firm, unit, startup, company or sector”. Wipro to acquire US cybersecurity firm Edgile for $ 230 million (CO) (AC) (GE) (SE) (CO) (MO) Propertyof ActionBuy Sector Locatedin Value Mobil had 30 % stake in fuels and 51 % stake in lubricants businesses . (CO) (QU) (FE) (SE) (QU) (FE) (SE) Value Value Propertyof Propertyof Sector Sector 157 4. International and Private/Public Sectors: Terms like International, Multinational, or Local are not considered part of the ’Sector’ entity unless they are integral to the business’s official description, such as “regional food supplier”. The same rule applies to the designations Private or Public. For instance, in the phrase “public accounting firm” the term ’public’ would be considered part of the ’Sector’ entity as it is integral to the description of the business. It denotes a specific type of accounting firm that deals primarily with publicly-traded companies. On the other hand, in the phrase “public university”, ’public’ doesn’t describe the type of service provided or the sector the entity belongs to, but rather the funding source of the institution. Cardinal Health , Inc. is an American multinational health care services company . (CO) (GE) (SE) Sector Locatedin 5. Overlap case, Company & Sector : If the sector is part of the company name, do not add a separate Sector entity, e.g “Enzo Pharmaceuticals”, do not label “Pharmaceuticals” as a Sector entity, it is part of the company name. Samsung Electronics produces LCD and LED panels , mobile phones and memory chips . (CO) (SE) (PR) (PR) (PR) Sector Productof Productof Productof 6. Overlap Sector vs Product: Refer to item 6 in Entity box 11 (Product) 7. Sector vs Designation: Refer to Entity box 5 (Designation) 8. Sector vs Action: Refer to Entity box 1 (Action) 9. Sector vs BusinessUnit: Refer to item 4 in Entity box 2 (BusinessUnit) 158 C.5 Detailed Relation Annotation Rules 159 Relation 1: ActionBuy 1. Majority vs Minority (and Intermediate FinancialEntity): A very common pattern is as follows: “Company A acquired [shares/equity/stake] of Company B”. In this case, the ActionBuy relation is to be annotated between “Company A” and “Company B” if it acquired at least a majority stake of “Company B” (or all assets). Otherwise, ActionBuy relation is between “Company A” and FinancialEntity. Note that in both cases, an additional Propertyof relation is needed between the seller Company, i.e. “Company B”, and the intermediate FinancialEntity as well as between the Action entity and the intermediate FinancialEntity. Facebook to acquire Instagram for $ 1 billion . (CO) (AC) (CO) (MO) Propertyof ActionBuy Value Effective August 1, 2016 , Zilton Inc. acquired some assets of Stern Agee LLC . (DA) (CO) (AC) (FE) (CO) Actionin Propertyof ActionBuy Propertyof 2. Transferring Ownership: Subsidiaryof and ActionBuy/ActionSell: See item 2 in relation box 11 3. Merger: A “merger” between two companies does not directly imply an ActionBuy or ActionSell relation, unless explicitly stated which Company acquired the other Company. “Merger” is always linked to both Companies involved using the Propertyof relation. Notice in the second sentence below how “Symetra Financial” has an ActionBuy relation since it was clarified that “Symetra Financial” is the Company acquiring the other Company. Also note in the two sentences below how the word “agreement” is always added to the Action entity, “merger agreement” On January 16 2007 , Zilton Inc. entered into a merger agreement with CNS Response . (DA) (CO) (AC) (CO) Actionin Propertyof Propertyof 160 Symetra Financial announced the merger agreement with Sumitomo Life under which it will acquire all the outstanding shares of Sumitomo . (CO) (AC) (CO) (AC) (FE) (CO) ActionBuy Propertyof Propertyof Propertyof Propertyof Relation 2: Actionin 1. Plan date vs Action date: Announcing plans to an Action is treated the same as completing the Action, i.e. an Actionin relation is added. On March 1 , 2010 , GE announced plans to sell its 20.85 % stake in Turkey - based Garanti Bank . (DA) (CO) (AC) (QU) (FE) (GE) (CO) Locatedin Valuein Value ActionSell Actionin Propertyof Propertyof Relation 3: ActionSell 1. Same rules apply to ActionBuy relation, check relation box 1 2. Transferring Ownership: Subsidiaryof and ActionBuy/ActionSell: See item 2 in relation box 11 3. Some more examples are shown below. Effective August 1, 2016 , Zilton Inc. acquired a minority stake in Stern Agee LLC from Clater . (DA) (CO) (AC) (FE) (CO) (CO) Actionin Propertyof ActionBuy ActionSell Propertyof Indiabulls Housing Finance promoter Sameer Gehlaut sells 12 % stake in company . (CO) (DE) (PE) (AC) (QU) (FE) Employeeof Designation ActionSell Propertyof Value 161 In March 2011 , GE announced that it had completed the acquisition of privately held Lineage Power Holdings from The Gores Group . (DA) (CO) (AC) (CO) (CO) Actionin Propertyof ActionBuy ActionSell Relation 4: Designation 1. Note: Do not confuse the Designation entity (see 5) with this Designation relation. 2. All tenses: past, present and future Designations are linked to the Person or Company through the Designation relation. Mr. Rubin previously served as a director of Dreams Inc. . Today he serves as the interim CEO at our company . (PE) (DE) (CO) (DE) Designation Employeeof Designation 3. Company & Designation: Companies can also take the Designation relation, usaully describing the role in a financial transcation. Indenture between Comcast Corporation , the subsidiary guarantor party thereto , and The Bank of New York Mellon , as trustee . (CO) (DE) (CO) (DE) Designation Designation Relation 5: Employeeof 1. The Employeeof relation can be stated explicitly through a Designation entity, e.g. “CEO”,“president” or can be implied, e.g. “Tim Cook of Apple”. Albert Bourla of Pfizer on Monday said two doses of the company’s vaccine may not provide protection against infection from omicron variant . (PE) (CO) (DA) (PR) Employeeof Productof 162 Shares of Uber jumped Tuesday after CEO Dara Khosrowshahi said the rideshare company had its best week ever in terms of overall bookings . (FE) (CO) (DA) (DE) (PE) (SE) (FE) Propertyof Employeeof Sector Designation 2. All tenses: past, present and future employments are linked to the Person through the Employeeof relation. Mr. Rubin previously served as a director of Dreams Inc. . Today he serves as the interim CEO at OPKU . (PE) (DE) (CO) (DE) (CO) Designation Employeeof Designation Employeeof 3. A Person simply owning shares of a Company is not an employee of the company. During Falcon ’s latest shareholder meeting , Jack Peterson suggested moving the election date of the Board . (CO) (DE) (PE) (DA) (DE) Designation Relation 6: Locatedin 1. Company and Locatedin: A Company has a Locatedin relation with a Location or GeopoliticalEntity only if it is based in that location, i.e. it is headquartered there. Having branches, selling products or generating revenue in a location does not warrant a Locatedin relation. Cision says the sale will return its U.K. operation to profitability . (CO) (AC) (GE) Automation Anywhere was founded in 2003 in San Jose , US . (CO) (AC) (DA) (LO) (GE) Locatedin Locatedin Propertyof Actionin 2. Nested locations: “A− >B− >C” pattern, label “A” Locatedin “B” and “B” Locatedin “C”. A very common pattern is the following: “The American Company is headquartered in city, state”. In this case, four Locatedin 163 relations need to be added: [company–Locatedin–American], [company–Locatedin–city], [city–Locatedin–state], [state–Locatedin–American] Broadcom Inc. is an American developer , manufacturer and global supplier of semiconductor products headquartered in San Jose , California . (CO) (GE) (SE) (SE) (SE) (PR) (LO) (LO) Locatedin Sector Sector Sector Productof Locatedin Locatedin Locatedin 3. BusinessUnit and Locatedin: A BusinessUnit, e.g. branch/factory/plant/facility, can take a Locatedin relation. Ingles Markets owns multiple Supermarket stores in the southeast U.S. . (CO) (BU) (LO) Propertyof Locatedin 4. Quantity/Money and Locatedin: Quantity and Money entities can take a Locatedin relation. Example: A Quantity could be the number of products sold in a Location while Money could be the revenue generated in a Location. We had approximately 408,000 employees as of December31 , 2009 , of which 340,000 are in the U.S. and 68,000 are located internationally . (QU) (BU) (DA) (QU) (GE) (QU) Quantity Valuein Valuein Locatedin Valuein Quantity Quantity Industry sales in the European Union are forecast to decline 5% in 2017 , while south American sales increase 15% from 2016 levels. (FE) (GE) (QU) (DA) (LO) (FE) (QU) (DA) Locatedin ValueChangeDecreaseby Valuein ValueChangeIncreaseby Valuein Locatedin 164 Relation 7: Productof 1. Any Product that is manufactured/marketed/sold by a Company is a Product of that Company. If multiple Companies are involved in the process, e.g. one Company manufactures the Product while another Company markets it, then the Productof relation is added to both Companies. 2. Product of Subsidiaries/Brands: If the Product of a subsidiary company is mentioned, do not add a Productof relation for the parent Company, this should be deduced from the Subsidiaryof relation. Similarly, in cases where a product of “Company A” is sold/marketed under its brand name “Company B”, add the Productof relation only for the subsidiary “Company B”. Notice in the example below how the Product “earbuds” is annotated as the Productof the brand names, not the Company “Harman”. In 2017 , Samsung acquired Harman International . Harman makes earbuds under many brand names such as AKG , AMX and Becker . (DA) (CO) (AC) (CO) (CO) (PR) (CO) (CO) (CO) Actionin Propertyof ActionBuy Subsidiaryof Subsidiaryof Subsidiaryof Productof Productof Productof CooperVision markets single - use silicon hydrogel spherical , toric and multifocal lenses under our clariti 1day brand . (CO) (PR) (CO) Productof Subsidiaryof Relation 8: Propertyof 1. Action and Company/Person: An Action entity is always linked to the Company receiving the Action through a Propertyof relation. In the case of a “merger”, the “merger” Action is linked to both Companies involved using the Propertyof relation. 165 Facebook to acquire Instagram for $ 1 billion . (CO) (AC) (CO) (MO) Propertyof ActionBuy Value Effective August 1, 2016 , Zilton Inc. acquired some assets of Stern Agee LLC . (DA) (CO) (AC) (FE) (CO) Actionin Propertyof ActionBuy Propertyof On January 16 2007 , Zilton Inc. entered into a merger agreement with CNS Response . (DA) (CO) (AC) (CO) Actionin Propertyof Propertyof 2. BusinessUnit/FinancialEntity and Company: A BusinessUnit is always linked to its owner Company through the Propertyof relation. One exception is mentioned in the next point (Transferring ownership). Similarly, a FinancialEntity is linked to its corresponding Company. Common examples include: “stock/share” of a Company, “revenue” of a Company and ”ownership interest” of a Company. Tesla shares dropped %4 over the weekend . (CO) (FE) (QU) (DA) Propertyof ValueChangeDecreaseby Valuein 3. Transferring Ownership: BusinessUnit and Company: A BusinessUnit is usually linked to a Company through the Propertyof relation. The exception is when the BusinessUnit is being sold by its parent Company. In this case, an ActionSell relation should be labeled between the BusinessUnit and the parent Company. No Propertyof relation is added, neither to the buying Company nor to the selling Company. 166 Intel completed the acquisition of the Wireless Solutions (WLS) division of Infineon early in 2011 for US $1.4 billion . (CO) (AC) (BU) (CO) (DA) (MO) Propertyof ActionBuy ActionSell Actionin Value Valuein 4. Breaking up a long FinancialEntity: Following rule 5 from the genral rules (Elliptical pattern, conjunctions and prepositions): It is very common to encounter a very long sequence of a FinancialEntity. If the annotator can reasonably spot two separate FinancialEntities in this long phrase, the phrase can be broken up into two pieces and linked together through a Propertyof relation. Typical keywords that help spot this divide are “entity1 is related to entity2”, “entity1 is a part of entity2” and so on. Below we show two examples. In the first example, the FinancialEntity cannot be broken because the whole phrase is describing one concept. On the other hand, the second example clearly distinguishes between the two entities, by describing one being part of the other. In this case, the phrase is broken into two FinancialEntities that are linked together using a Propertyof relation. As of October 2017 , we recorded a non-current deferred tax liability (net of valuation allowance) , of $5.5 million . (DA) (FE) (MO) Value Valuein As of October 2017 , we recorded a non-current deferred tax liability as part of the net of valuation allowance , of $5.5 million . (DA) (FE) (FE) (MO) Propertyof Value Valuein 5. FinancialEntity part of another FinancialEntity: A very common pattern is as follows: “[FinancialEntity1] comprised x% of [FinancialEntity2]”. In this case two relations are to be annotated: “[FinancialEntity1]” has a Value relation of “x%” and “[FinancialEntity1]” is Propertyof “[FinancialEntity2]” 167 Marketing expenses were 30% of our overall expenses in 2017 . (FE) (QU) (FE) (DA) Value Propertyof Valuein 6. FinancialEntity and Product: A very common pattern is to describe the [sales/revenue/etc.] from a Product. In this case, the FinancialEntity is linked to the Product though a Propertyof relation. Sales of Electra 350 dipped 7.32 percent to 4,521 units in the past month . (FE) (PR) (QU) (QU) (DA) ValueChangeDecreaseby Quantity Propertyof Valuein Valuein 7. General “Price” of a FinancialEntity: Refer to item 3 in Entity box 6 (FinancialEntity) Relation 9: Quantity 1. Do not confuse the Quantity entity (a countable or non-countable numeric) with this Quantity relation (only countable quantity of an entity) 2. Countable vs non-countable Quantity entities: A Quantity entity can be a countable value (e.g. number of shares) or a non-countable value (e.g. percentage, interest rate). Countable entities take the Quantity relation while non-countable entities take the Value relation. Notice in the first example below, despite “maximum of 2.5%” being a Quantity entity, it is used in a Value relation, not a Quantity relation. The other two examples showcase typical usage of the Quantity relation. The Fed said on Wednesday it enacted its second consecutive 0.75 percentage point interest rate increase , bringing it to a maximum of 2.5 % . (DA) (QU) (FE) (QU) Valuein ValueChangeIncreaseby Value Valuein 168 Tasla Motors global wholesales sees 2 % rise to 285,445 units in Q3 . (CO) (FE) (QU) (QU) (DA) ValueChangeIncreaseby Propertyof Quantity Valuein Valuein We had approximately 408,000 employees as of December 31 , 2009 , of which 340,000 are in the U.S. and 68,000 are located internationally . (QU) (BU) (DA) (QU) (GE) (QU) Valuein Locatedin Valuein Valuein Quantity Quantity Quantity Relation 10: Sector 1. Sector of subsidiaries: If the Sector of a subsidiary is mentioned, do not add a Sector relation for the parent Company. This would be deduced from the Subsidiaryof relation. Elcoteq Dongguan is one of Elcoteq ’s four volume manufacturers in the Asia - Pacific region . (CO) (CO) (QU) (SE) (LO) Subsidiaryof Locatedin Sector Locatedin Relation 11: Subsidiaryof 1. Brands of a company should be labeled as separate companies and then the Subsidiaryof relation is added to the parent company The Volkswagen Group sells passenger cars under the Audi , Bentley and Bugatti among other marques . (CO) (PR) (CO) (CO) (CO) Subsidiaryof Subsidiaryof Subsidiaryof Productof Productof Productof 169 2. Transferring Ownership: Subsidiaryof and ActionBuy/ActionSell: A Company is usually linked to its parent Company through the Subsidiaryof relation. The exception is when the Company is being sold by its parent Company. In this case, an ActionSell relation should be labeled between the Company and the parent Company. No Subsidiaryof relation is added, neither to the buying Company nor to the selling Company. In March 2011 , GE announced that it had completed the acquisition of privately held Lineage Power Holdings from The Gores Group . (DA) (CO) (AC) (CO) (CO) Actionin Propertyof ActionBuy ActionSell Relation 12: Value 1. Value and ActionBuy/ActionSell: When an ActionBuy/ActionSell relation is present, and the monetary value of the sale is present as well, the head entity of the Value relation should be the Company or FinancialEntity being bought/sold. Office Depot Sells CompuCom in $ 305 M Deal . (CO) (AC) (CO) (MO) ActionSell Propertyof Value 2. Intermediate FinancialEntity: A very common pattern is as follows: “Company A acquired [shares/equity/stake] of Company B for Money”. In the case where this does not indicate that a Company acquired ther other Company (got at least a majority stake) then the Value relation is between the FinancialEntity ([shares/equity/stake]) and the Money entity. Zilton Inc. acquired all of the equity interests of Stern Agee LLC for $3 million . (CO) (AC) (FE) (CO) (MO) Propertyof Value ActionBuy Propertyof 170 In January 2015 , Intel purchased a 30 % stake in Vuzix , a smart glasses manufacturer . The deal was worth $ 24.8 million . (DA) (CO) (AC) (QU) (FE) (CO) (PR) (SE) (MO) Sector Productof Valuein ActionBuy Propertyof Propertyof Value Valuein Value 3. Countable vs non-countable Quantity entities: A Quantity entity can be a countable value (e.g. number of shares) or a non-countable value (e.g. percentage, interest rate). Countable entities take the Quantity relation while non-countable entities take the Value relation. Refer to item 2 in relation box 9 (Quantity) for more examples. The Fed said on Wednesday it enacted its second consecutive 0.75 percentage point interest rate increase , bringing it to a maximum of 2.5 % . (DA) (QU) (FE) (QU) Valuein ValueChangeIncreaseby Value Valuein 4. General “Price” of a FinancialEntity: “The average price of the Company’s’ shares is $5” then “average price” takes a Value relation of “$5” and “average price” takes a Propertyof relation with “shares”. For more details, refer to item 3 in entity box 6 (FinancialEntity). Relation 13: ValueChangeDecreaseby 1. Money entities , as well as Quantity entities (whether countable or non-countable), can take a ValueChangeIncreaseby or a ValueChangeDecreaseby relation. Sales of Electra 350 dipped 7.32 percent to 4,521 units in the past month . (FE) (PR) (QU) (QU) (DA) ValueChangeDecreaseby Value Propertyof Valuein Valuein 171 In Q3 , revenue rose 8 % from a year ago to $ 15.3 billion . (DA) (FE) (QU) (DA) (MO) ValueChangeIncreaseby Value Valuein Valuein Relation 14: ValueChangeIncreaseby 1. Check relation box 13 (ValueChangeDecreaseby) Relation 15: Valuein 1. Year comparisons: When a Money or a Quantity entity is being compared against an old Date, the Valuein relation doesn’t hold with the old Date, only with the current Date (if mentioned). Notice in the example below how the decrease of net earning is “36% compared to 2007” but not in “2007”. The date the decrease happened is not mentioned in the sentence so there is no Valuein relation. The same pattern occurs again in the second example with “2016”. We generated net earnings of $ 3.0 billion , representing a decrease of 36 % compared to 2007 . (FE) (MO) (QU) (DA) Value ValueChangeDecreaseby Industry sales in the European Union are forecast to decline 5% in 2017 , while south American sales increase 15% from 2016 levels. (FE) (GE) (QU) (DA) (LO) (FE) (QU) (DA) Locatedin Valuein Valuein Locatedin 2. Quantity entities (whether countable or non-countable), as Money entities, always take a Valuein relation if a Date is present. 172 We had approximately 408,000 employees as of December 31 , 2009 , of which 340,000 are in the U.S. and 68,000 are located internationally . (QU) (BU) (DA) (QU) (GE) (QU) Valuein Locatedin Valuein Valuein Quantity Quantity Quantity 173 C.6 General Rules Below we list some general rules and best practices which do not apply to a specific relation or entity class. We follow some of the rules listed in ACL RD-TEC Annotation Guideline 1. Determiners (Following the ACL RD-TEC Annotation Guideline) • Do not annotate determiners (e.g. “the”, “a”, “every”, etc.) or pronouns e.g. “the net income in ...”, unless it is part of the name, usually will be capitalized, e.g. “The Bank of New York” 2. Term-abbreviation sequence (Following the ACL RD-TEC Annotation Guideline) • In the case that a term is followed by its abbreviation, the whole sequence is annotated as one term, e.g. “University of Southern California (USC)” Biocon on Monday announced the merger of its material subsidiary Biocon Biologics Ltd (BBL) , with Covidshield Technologies (CTPL) . (CO) (DA) (AC) (SE) (CO) (CO) Subsidiaryof Sector Actionin Propertyof Propertyof 3. Maximal Length Annotation Principle (Following the ACL RD-TEC Annotation Guideline) • We will mostly follow a greedy approach, i.e., label everything to the right of the entity as long as it is still describing the same concept. • Generic nouns: Terms can be accompanied by generic nouns, e.g., the word “business” in the Sector entity “Tech business”. Although it can be argued that “business” is a generic noun and the entity is “Tech”, for a number of considerations annotators are asked to 174 mark “Tech business” as the entity. The added noun does not need to bring additional meaning or information to the entity. • Notice in the example below how “server products” as a phrase is labeled as a Product, although the term “products” doesn’t really add a new meaning to the entity. Gross margin increased $ 4.8 billion or 22 % , driven by growth in server products and cloud services revenue . (FE) (MO) (QU) (PR) (PR) (FE) ValueChangeIncreaseby Propertyof Propertyof ValueChangeIncreaseby 4. Adjectival modifiers (Following the ACL RD-TEC Annotation Guideline) • Contrary to the previous rule, adjective modifiers (modifiers to the left of an entity) are not always annotated as part of the entity. Only when an adjective adds meaning to the entity should it be labeled. • Positive Example: Take an example the phrase: “low-interest loan”. The FinancialEntity is the word “loan” but since “low-interest” adds meaning by specifying the type of “loan”, the whole phrase is labeled as the FinancialEntity. Another example is provided below. “Interim” adds meaning to the Designation entity “CEO”. Mr. Rubin serves as the interim CEO at OPKU . (PE) (DE) (CO) Designation Employeeof • Negative Example: Consider the example below and specifically the Sector entity. We know that the actual Sector of the Company is “public accounting”. According to rule 3 (Maximal Length Annotation Principle), the word “firm” should be added to the entity. Now regarding the adjective phrase “independent registered”, since this doesn’t add 175 meaning to the Sector of the Company, rather it gives a fact regarding the Company’s registration, it should not be labeled. Deloitte Touche LLP is our independent registered public accounting firm . (CO) (SE) Sector 5. Elliptical pattern, conjunctions and prepositions (Following the ACL RD-TEC Annotation Guideline) • For conjunctions, if the noun phrases linked by them are ellipses, the whole span should be annotated as one. For example, in “two-wheel drive, four-wheel drive and all-wheel drive cars”, where we can also read “two-wheel drive cars, four-wheel drive cars and all-wheel drive cars”, the whole term is annotated as one Product entity. Otherwise, split the string at the conjunction and annotate the conjuncts separately. Toyota sells two wheel drive , four wheel drive and all wheel drive cars . (CO) (PR) Productof Total account and service fees increased $62 million . (FE) (MO) ValueChangeIncreaseby • Complex phrases containing prepositions can normally be split at those points where the prepositions are placed. Thus, for the text snippet ”interest on loan payments and generated revenues”, “interest”, “loan payments” and “revenues” are annotated as separate terms. Interest on loan payments and generated revenues totaled $5 million . (FE) (FE) (FE) (MO) Propertyof Propertyof Value 176 As of October 2017 , we recorded a non-current deferred tax liability as part of the net of valuation allowance , of $5.5 million . (DA) (FE) (FE) (MO) Propertyof Value Valuein 6. Nested/Overlap Entities • There is no general rule to apply when dealing with entity overlap between different entity types. The annotator needs to decide on a case-by-case basis and depending on the type of entities involved. Generally speaking, it is encouraged to avoid overlapping entity labeling, either by annotating the whole phrase as one entity or separating into two entities then using the Propertyof relation to link them. Listed below are some the common patterns which already has been discussed in this document. • Overlap Company & GeopoliticalEntity/Location: Refer to item 2 in entity box 3 (Company) • Overlap FinancialEntity & Product: Refer to item 4 in entity box 6 (FinancialEntity) • Overlap FinancialEntity & GeopoliticalEntity/Location: Refer to item 2 in entity box 7 (GeopoliticalEntity) • Overlap Sector & Company: Refer to item 5 in entity box 11 (Product) • Overlap Sector & Product: Refer to item 6 in entity box 11 (Product) 7. Past Relations • Relations that occured in the past are to be annotated in the same way as relations that occured in the present. 177 Lauren Hall previously was the president and CEO of Malton Inc. from 2016 to 2017 . (PE) (DE) (DE) (CO) (DA) (DA) Designation Designation Employeeof Net cash used in financing activities decreased $ 54 million from $ 72.4 million . (FE) (MO) (MO) ValueChangeDecreaseby Value 8. Hypothetical Events/Negative Relations • If an event did not happen, i.e. the sentence implies the event was cancelled/declined or the event is hypothetical, i.e. might or might not happen, then the associated relations with this event should not be labeled. Note in the example below, there is no ActionBuy or Value relation. In the second example, notice that there is no ValueChangeIncreaseby/ValueChangeDecreaseby relation. Elon Musk to backtrack from his offer to buy Twitter for $ 44 billion . (PE) (AC) (CO) (MO) Propertyof Industry sales in the European Union are forecast to decline 5% in 2017 , while south American sales increase 15% from 2016 levels. (FE) (GE) (QU) (DA) (LO) (FE) (QU) (DA) Locatedin Valuein Valuein Locatedin • Note: A reported event or estimated amount is not treated as the other hypothetical events and should be labeled. As a rule of thumb, phrases of the form “If ... then ... ” are what we consider hypothetical events. 178 Apple is reported to acquire AI startup Vocum for $10 million . (CO) (AC) (SE) (CO) (MO) ActionBuy Propertyof Sector Value 9. Repeated Occurrence of Relations • When an entity is repeated in a sentence, all instances of this entity should be labeled and considered for any relation, even if the relation will be repeated as well. • Notice in the example below, even though “Lauren Hall” is already labeled as being an Employeeof “Malton Inc.”, the second sentence, “Malton Inc.” is mentioned again and another Designation of “Lauren Hall” is added at the Company. Therefore, the annotator has to add an additional Employeeof relation due to the information in the second sentence. Lauren Hall was the president and CEO of Malton Inc. from 2016 to 2017 . Before that, she was the promoter of Malton Inc. . (PE) (DE) (DE) (CO) (DA) (DA) (DE) (CO) Employeeof Employeeof Designation Designation Designation 179 Appendix D ToolCritic Prompt Templates and Training Details D.1 Synthetic Data Generation Prompt and Details To train ToolCritic, we created a dataset by synthetically inserting errors into error-free dialogues from the SGD dataset. Below, we detail the process, starting with an example of a input dialogue and then explaining how errors were introduced. D.1.1 Demonstration Example We start with the following error-free dialogue from the SGD dataset: # Turn 1 USER: "I need to find a seat on a bus." ASSISTANT: "Where are you leaving from? Where are you going?" # Turn 2 USER: "I am leaving from Vancouver to go to Seattle." ASSISTANT: "When are you leaving?" # Turn 3 USER: "I am leaving on the 12th of this month. I need 1 ticket." 180 ASSISTANT: - API CALL: FindBus(from_location=’Vancouver’, leaving_date=’2019-03-12’, to_location=’Seattle’, travelers=’1’) - Description: Find a bus journey for a given pair of cities - Required Arguments: * from_location: City where bus is leaving from; is_categorical: False * to_location: City where bus is going to; is_categorical: False * leaving_date: Date of bus leaving for journey; is_categorical: False - Optional Arguments: * travelers: Number of travelers for journey; is_categorical: True; Possible Values: [’1’, ’2’, ’3’, ’4’, ’5’] - RESULT: * {’fare’: ’29’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’06:40’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} * {’fare’: ’31’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:10’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} * {’fare’: ’26’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:30’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} 181 * {’fare’: ’30’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:50’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} - RESPONSE: "I found multiple options. First leaves at 6:40 am and is $29. There are 0 transfers." # Turn 4 USER: "Anything else available?" ASSISTANT: "Next, 8:10 am for $31. Also has 0 transfers." # Turn 5 USER: "What is the station I am leaving from? What station am I arriving at?" ASSISTANT: "You are leaving from Pacific Central Station and arriving at King Street Station." # Turn 6 USER: "Any other options?" ASSISTANT: "Next, the bus departs at 8:30 am and the ticket cost is $26. Again, 0 transfers." # Turn 7 USER: "That will work, please reserve." ASSISTANT: "Please confirm, leaving Vancouver for Seattle on March 12th at 8:30 am. 1 Ticket reserved." # Turn 8 182 USER: "Yes that is correct." ASSISTANT: - API CALL: BuyBusTicket(from_location=’Vancouver’, leaving_date=’2019-03-12’, leaving_time=’08:30’, to_location=’Seattle’, travelers=’1’) - Description: Buy tickets for a bus journey - Required Arguments: * from_location: City where bus is leaving from; is_categorical: False * to_location: City where bus is going to; is_categorical: False * leaving_date: Date of bus leaving for journey; is_categorical: False * leaving_time: Time of bus leaving for journey; is_categorical: False * travelers: Number of travelers for journey; is_categorical: True; Possible Values: [’1’, ’2’, ’3’, ’4’, ’5’] - Optional Arguments: - RESULT: {’fare’: ’26’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:30’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} - RESPONSE: "Your ticket is confirmed." # Turn 9 USER: "Thank you, that will be all." ASSISTANT: "Have a great day!" Listing D.1: SGD Dialogue Example Notice the format we use to display the conversation. We group each assistant and user message into a single turn. This is to train ToolCritic to always expect the last turn to be an assistant turn. Notice also that we include the function specification after each invocation in the middle of the 183 dialogue. This ensures that ToolCritic can focus on the tool called and can recall its specification when needed. D.1.2 Hand-Writing Demonstration Examples Next, we show how we manually insert an error into this error-free dialogue and provide a detailed description to help guide the LLM in the following few-shot generation process. For this example, we will insert a “premature invocation” mistake at Turn 2. Error Location: Turn 2 Error Insertion Steps: 1- At Turn 3, locate the API call FindBus(). According to its documentation, this API takes three required arguments [’from_location’, ’to_location’, ’leaving_date’]. 2- The assistant executed this API at Turn 3 after getting all the information (required arguments) from the user to look for bus options. 3- To simulate the premature-invocation error, we will invoke this API in an earlier turn before getting all the needed information (required arguments) from the user. 4- At Turn 2, the user has so far only given information about the ’from_location’ and ’ to_location’ arguments but still hasn’t provided information regarding their desired ’ leaving_date’. 5- Copy the API call and RESULT from Turn 3 to Turn 2. The assistant’s response can also be copied without change. Explanation (Reasoning): 184 The user wanted to look for bus options leaving from Vancouver to Seattle. The assistant invoked the FindBus() API. Looking at the documentation for this API, we see that it takes three required arguments [’from_location’, ’to_location’, ’leaving_date’]. At this point in the conversation, the user has not yet provided values for all the required arguments, specifically, the user hasn’t yet provided the desired ’ leaving_date’. Before searching for bus options using the FindBus() API, the assistant should have obtained more information from the user, specifically about the desired ’leaving_date’. Instead, the assistant invoked the FindBus() API prematurely with some hallucinated argument values, before getting all required info. Therefore, this is a premature-invocation error. Error Turn: # Turn 2 USER: "I am leaving from Vancouver to go to Seattle." ASSISTANT: - API CALL: FindBus(from_location=’Vancouver’, leaving_date=’2019-03-12’, to_location=’Seattle’, travelers=’1’) - Description: Find a bus journey for a given pair of cities - Required Arguments: * from_location: City where bus is leaving from; is_categorical: False * to_location: City where bus is going to; is_categorical: False * leaving_date: Date of bus leaving for journey; is_categorical: False - Optional Arguments: * travelers: Number of travelers for journey; is_categorical: True; Possible Values: [’1’, ’2’, ’3’, ’4’, ’5’] - RESULT: 185 * {’fare’: ’29’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’06:40’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} * {’fare’: ’31’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:10’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} * {’fare’: ’26’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time * {’fare’: ’26’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:30’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} * {’fare’: ’30’, ’from_location’: ’Vancouver’, ’from_station’: ’Pacific Central Station’, ’leaving_date’: ’2019-03-12’, ’leaving_time’: ’08:50’, ’ to_location’: ’Seattle’, ’to_station’: ’King Street Station’, ’transfers’: ’0’, ’travelers’: ’1’} - RESPONSE: "I found multiple options. First leaves at 6:40 am and is $29. There are 0 transfers." Listing D.2: Demonstration example using the premature-invocation error category. Note that while all the above hand-written “Error Insertion Steps” will only be used to few-shot the LLM for data generation, but will not be used when fine-tuning ToolCritic. ToolCritic will be fine-tuned to output the error type and the “reasoning thought” only. 186 D.1.3 Few-Shot Data Generation Prompt After creating multiple demonstration examples like the one above, we used a few-shot prompting approach to instruct an LLM (Claude 3.5 Sonnet) to insert new errors into additional error-free dialogues from the SGD dataset. Below is the system prompt and user prompt used for this task. System Prompt: - You are given a task-oriented dialogue <query> between a user ("USER") and an assistant ("ASSISTANT"). - Your task is to modify the <query> dialogue to simulate a certain type of error made by the assistant, as described in <error-description>, therefore producing a "corrupted " dialogue. - The dialogue format at each turn follows one of these two options: 1- If the assistant did not invoke an API during this turn, then the format is as follows: # Turn n # The index of the current turn in the conversation containing both USER and ASSISTANT messages. USER: # A user message ASSISTANT: # The assistant message 2- If the assistant invoked an API during this turn, then the format is as follows: # Turn n # The index of the current turn in the conversation containing both USER and ASSISTANT messages. USER: # A user message ASSISTANT: # The start of the assistant field - API CALL: # The API called by the assistant alongside the argument values. This function returns a RESULT field. - Description: # Description of the API 187 - Required Arguments: # List of required arguments of the API with their descriptions - Optional Arguments: # List of optional arguments of the API with their descriptions - RESULT: # The output of the API call. This field will contain a list of option results if the API call was a search/lookup query, or it will contain a dictionary containing the result of an action API such as booking/reservation. - RESPONSE: # The final assistant response after observing the output of the API CALL. - Format your output as a JSON object containing the following four fields: * Error Insertion Steps: Step-by-Step description of how you simulated this error in the dialogue and what changes were made and at which turns. * Error Location: The index of the turn where the error was introduced. * Explanation: A short paragraph describing how to spot the assistant error, from the perspective of someone reading the dialogue. The description should be step-bystep, in a Chain-of-Thought fashion. * Corrupted Dialogue: The updated turn at which the error was inserted. - You are given a few examples to guide you in <demonstrations>. You are also given a hint in <hint> to help you decide where and how to insert the error. - Important: - Since all the dialogue turns before the error turn will be kept unchanged, do not return those in the "Corrupted Dialogue" field, only return the turn at which the error was inserted. 188 - If there are multiple possible locations to insert an error, choose the one that best matches the error description and that follows the given demonstration examples. If all things are equal, insert the error in the later turn in the dialogue instead of the earlier turns. - Only return the JSON Object. Do not include any additional text. Listing D.3: LLM system prompt for synthetic data generation User Prompt: Following is the description of the error you are tasked with simulating <error-description> {} </error-description> You have the following demonstration examples to guide you <demonstrations> {} </demonstrations> Now, your task is to modify the following <query> dialogue to introduce the error described above in <error-description> <query> {} </query> You are given the following hint <hint> {} 189 </hint> JSON Output: Listing D.4: LLM user prompt for synthetic data generation Notice the <hint> used at the end of the user prompt. The purpose of this hint is to ensure diversity in data generation and avoid model biases. For instance, we observed in our initial experiments that Claude 3.5 Sonnet tends to insert more errors into earlier turns in the dialogues or into tool calls that appeared frequently in the few-shot examples. To mitigate this bias, the hint provides specific guidance, such as the location, which tool argument to focus on, and how to insert the error. The hint is generated by randomly sampling a viable error location from the list of turns and instructing the LLM to insert the error there. For example, if the error category is “required arguments,” the hint might look something like: "Focus on the tool call ReserveRoundTripFlights() at Turn 5 and on the argument departure date." This approach ensures a more diverse and representative error dataset. D.2 Supervised Fine-Tuning Prompt and Training Details D.2.1 Fine-Tuning Setup The fine-tuning process was conducted on the Llama 3.1 8B Instruct version, with the following hyperparameters: a learning rate of 1e-5, a batch size of 1, and a temperature of 0.5. The training was performed using bfloat16 precision across 5 epochs, with a cosine learning rate scheduler and a warmup ratio of 0.05. The validation data split was employed exclusively for selecting the learning rate, while all other parameters were kept constant. 190 Figure D.1: Histogram displaying the length (in tokens) of the input prompt and output response of the fine-tuning training dataset. We utilized the "DataCollatorForCompletionOnlyLM" function from the trl package ∗ , enabling the model to be fine-tuned on completions only. This means that the input prompt and dialogue do not influence the model weights; only the gradients from the response labels are used to update the model. Full fine-tuning was applied, updating all model weights. To manage token lengths, we limited input prompts to the 95th percentile of the training input prompt lengths, approximately 4971 tokens. The response token length was capped at the 99th percentile of training response lengths, around 218 tokens. Figure D.1 illustrates the distribution of input and output token lengths. Notably, there is a spike in the response length histogram around 50 tokens, representing the fixed-length response used when the input dialogue is error-free. The standardized response is: "The assistant’s final response was appropriate. From the predefined list of error types, the assistant did not commit any errors in the final turn. Therefore, this is a correct turn." ∗See the "DataCollatorForCompletionOnlyLM" from the trl package https://huggingface.co/docs/trl/en/ sft_trainer 191 The fine-tuning process adhered to the prompt template style outlined in the official Llama documentation, using headers such as ASSISTANT_HEADER = "<|start_header_id|> assistant <|end_- header_id|>" and USER_HEADER = "<|start_header_id|> user <|end_header_id|>". This format is supported within the Hugging Face library through the tokenizer.apply_chat_template function. D.2.2 Fine-Tuning Prompt Template The following is the complete fine-tuning prompt template used during training: Task: Analyze a conversation history between a human user ("USER") and an AI assistant (" ASSISTANT") for potential errors in the assistant’s final response. Context: You will be provided with: a) A list of API functions available to the assistant in <api-pool> b) A list of common error types the assistant may make in <error-description> c) The conversation transcript between USER and ASSISTANT up to a certain turn. Instructions: Carefully review the provided API functions, error types, and conversation history. Analyze the assistant’s response in the last turn of the conversation for any of the given error types. If you detect an error, provide your reasoning for the identified error. If no errors are detected, simply state that the assistant’s response was appropriate. 192 Please provide your analysis based on the information given in the conversation transcript and the provided API and error type descriptions only. <error_types> {} </error_types> <api_pool> {} </api_pool> Listing D.5: ToolCritic Fine-Tuning Prompt Template Where the <error_types> field contains the list of all error categories with descriptions, and the <api_pool> field includes the list of all tool functions available in the dataset. ToolCritic requires knowledge of all available tools to accurately detect error types, such as tool-prediction errors and non-invocation errors. D.3 Performance difference on Error Categories in Low-Label Regime In this section, we analyze the performance of our diagnostic model across different error categories when fine-tuned on a smaller portion of the training set—specifically, 10% of the data, which equates to 22 examples per error category. Figure 6 presents the confusion matrix for the test data split using this fine-tuned model. 193 Figure D.2: Confusion matrix on test dataset when training with 10% training data only From the confusion matrix, it is evident that the model achieves relatively high classification accuracy on the “non-invocation confirmation”, “non-invocation hallucination”, and “observationreasoning” error categories. In contrast, for the “tool-prediction” (api-prediction) and “non-invocation hesitation” categories, the model correctly classifies only about half of the examples. The remaining categories, however, prove to be the most challenging, with the model struggling to detect examples from these error types. This disparity in performance underscores the difficulty of accurately identifying certain error types in a low-label regime. The results suggest that while the model can generalize well on some categories even with limited data, other categories may require more extensive training data or 194 additional techniques to improve detection accuracy. This could also be an indication that the model is relying on some heuristics to easily detect some error categories. Further data analysis is needed to investigate the reason behind this performance gap, and whether such heuristics can be expected in practice or if this is an artifact of synthetic data generation. D.4 Illustrative Examples of ToolCritic’s Generalization Ability In this section, we provide illustrative examples to demonstrate the potential generalization capability of ToolCritic to new datasets and tools. Specifically, we focus on two examples from the ToolTalk dataset. In both examples, we assume the assistant (Claude 3 Sonnet) has access to a tool called AddAlarm with the following specification: { "name": "AddAlarm", "description": "Add an alarm at a particular time", "inputSchema": { "json": { "type": "object", "required": ["time"], "properties": { "time": { "type": "string", "description": "Alarm Time" }, "name": { "type": "string", "description": "A name for the alarm" 195 } } } } } Listing D.6: Tool Specification from the ToolTalk Dataset D.4.1 Scenario 1: Non-Invocation Confirmation Error In the first scenario, we assume the user asks the assistant to set an alarm at a specific time. To simulate an LLM mistake, we manually write the assistant’s response to confirm the alarm has been set, despite no tool being called. The dialogue is as follows: # Turn 1 USER: "Hey I have class tonight at 7. Can you set an alarm for 6:30?" ASSISTANT: "Can you please confirm, you want to set an alarm at 6:30 pm?" # Turn 2 USER: "Yes" ASSISTANT: "Your alarm is now set. Anything else I can help you with?" Listing D.7: Scenario 1 example demonstrating a non-invocation confirmation error We then feed this dialogue to ToolCritic, which produces the following output: 196 The user confirmed to the assistant to set an alarm for 6:30 pm. At this point, the assistant should have invoked the AddAlarm API in order to set the alarm. Instead, the assistant did not invoke the API but still confirmed to the user that the alarm was set. This is false and hallucinated because the assistant did not call the API. Therefore, this is a non-invocation-confirmation error. Listing D.8: ToolCritic Response with Scenario 1 dialogue as input ToolCritic correctly identifies this as a "non-invocation-confirmation" error. Interestingly, ToolCritic reasons that the AddEvent tool should have been called instead of AddAlarm. This is expected since ToolCritic was trained on the tools available in the SGD dataset and has no knowledge of the AddAlarm tool. Given this feedback and the conversation history, we prompt Claude 3 Sonnet to revise the incorrect response, resulting in the following correct response: ASSISTANT: - API CALL: AddAlarm(name=’Class alarm’, time=’18:30’) - RESULT: {’name’: ’Class alarm’, ’time’: ’18:30’} - RESPONSE: "I’ve set an alarm named ’Class alarm’ for 6:30 PM (18:30) to remind you before your class tonight. Please let me know if you need anything else." Listing D.9: Claude 3 revised response to Scenario 1 given ToolCritic feedback D.4.2 Scenario 2: Premature Invocation Error In the second scenario, we present a real mistake made by Claude 3 Sonnet according to the following dialogue: # Turn 1 USER: "Hey I have an important class tonight evening between 8pm and 10 pm." 197 ASSISTANT: - API CALL: AddAlarm(name=’Important Class’, time=’20:00’) - RESULT: {’name’: ’Important Class’, ’time’: ’20:00’} - RESPONSE: "I’ve set an alarm named ’Important Class’ for 8:00 PM tonight, March 1st, 2019 to remind you of your important class from 8-10 PM. Please let me know if you need anything else!" Listing D.10: Scenario 2 example demonstrating a premature invocation error In this scenario, the user informed the assistant that they have an important class from 8 PM to 10 PM. An ideal response would have been to ask the user if they wanted to set an alarm and, if so, at what time. Instead, Claude 3 Sonnet sets an alarm at 8 PM without user confirmation. While this response is not entirely incorrect, it could be improved. When we feed this dialogue history to ToolCritic, we receive the following feedback: The user mentioned they have an important class tonight between 8pm and 10pm. The assistant invoked the AddAlarm API. At this point in the conversation, the user has provided the time information, but has not yet confirmed that they want to set an alarm. Before adding an alarm using the AddAlarm API, the assistant should have obtained confirmation. Therefore, this is a premature invocation error. Listing D.11: ToolCritic response with Scenario 2 dialogue as input ToolCritic correctly detects this as a "premature invocation" error, highlighting the assistant’s mistake of setting the alarm without user confirmation. Following this feedback, we prompt Claude again, leading to the following revised response: ASSISTANT: "Okay, I understand you have an important class tonight from 8 PM to 10 PM. 198 Would you like me to set an alarm to remind you about this class?" Listing D.12: Claude 3 revised response to Scenario 2 given ToolCritic feedback This revised response shows that, with the help of ToolCritic’s feedback, Claude 3 Sonnet can produce a more accurate and user-aligned response, ensuring that the assistant’s actions align with the user’s intent. 199
Abstract (if available)
Abstract
In the rapidly evolving field of deep learning, the increasing complexity and scale of neural network models pose significant challenges in terms of computational demand and data requirements. As models grow in size, the computational costs associated with training them escalate, making compute efficiency a critical area of research. At the same time, these models are “hungry” for labeled data-resources that are often costly and time-consuming to procure. This is especially true in domain-specific settings, where data scarcity is pronounced and compute is not a concern. These two regimes: compute-bound training and data-bound training, necessitate innovative approaches to accomplish two separate goals: reducing the computational cost of training and training better models with minimal labeled data.
The primary goal of this dissertation is to explore and develop novel techniques that enhance train- ing efficiency under the two mentioned regimes. On the compute angle, dot-products computations (also referred to as multiply-accumulate) dominate the cost of training neural networks. Therefore, a fundamental question is first investigated: What type of computations are needed to train neural network models? Is there a more efficient alternative to dot-products? Based on the results of this investigation, an implementation of training neural network models, specifically Convolutional Neural Networks (CNNs), using fixed-point Logarithmic Number System (LNS) follows. The second part of this dissertation tackles the challenges of data scarcity in domain-specific natural language processing (NLP) tasks. We first explore curriculum learning-based strategies to train on limited labeled data through a novel difficulty criterion, labeling time. Next, building on the trend of large language models exhibiting enhanced reasoning abilities, this dissertation also investigates the use of synthetic data generation to expand the availability of labeled data. This approach is applied to address downstream NLP tasks such as information extraction and improving the accuracy of tool-use in large language models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploring complexity reduction in deep learning
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Deep learning models for temporal data in health care
PDF
Efficient graph learning: theory and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Deep learning for subsurface characterization and forecasting
PDF
Fast and label-efficient graph representation learning
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Learning controllable data generation for scalable model training
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Learning at the local level
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Human motion data analysis and compression using graph based techniques
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
Asset Metadata
Creator
Hamad, Hassan
(author)
Core Title
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-12
Publication Date
12/13/2024
Defense Date
11/18/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,efficiency,llm,lns,machine learning,neural networks,OAI-PMH Harvest,quantization,synthetic data
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chugg, Keith Michael (
committee chair
), Beerel, Peter (
committee member
), Jia, Robin (
committee member
)
Creator Email
hhamad@usc.edu,hmmhhamad@gmail.com
Unique identifier
UC11399EVMP
Identifier
etd-HamadHassa-13694.pdf (filename)
Legacy Identifier
etd-HamadHassa-13694
Document Type
Dissertation
Format
theses (aat)
Rights
Hamad, Hassan
Internet Media Type
application/pdf
Type
texts
Source
20241217-usctheses-batch-1229
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
efficiency
llm
lns
machine learning
neural networks
quantization
synthetic data