Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Aggregating symbols for language models
(USC Thesis Other)
Aggregating symbols for language models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AGGREGATING SYMBOLS FOR LANGUAGE MODELS by Avijit Thawani A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2024 Copyright 2024 Avijit Thawani I dedicate my dissertation to My family, for their infinite support, faith, and love My partner Drishti, for letting me neither drift too low nor too high. ii Acknowledgments I want to express my deepest gratitude to my advisor, Jay Pujara, for making a Doctor out of nothing more than an enthusiastic kid who knocked on his doors five years ago. I am grateful to him for his constant support, for the constructive feedback, for his prophetic guidance on my research outcomes, and for supporting me nonetheless as I failed and learnt, for insulation from the ups and downs of peer review, and for pushing me to achieve just a little more everytime. I feel extremely privileged to have been a part of the amazing ISI1 family. I am grateful to my committee members, past and present: Pedro Szekely for co-advising me for the first few years and believing in all of my enthusiastic ideas; Dani Yogatama for the helpful triweekly discussions when at DeepMind; Swabha Swayamdipta for guiding my Teaching Assistantship journey teaching Language Modeling to undergraduates at USC; and Gerard Hoberg and Aiichiro Nakano for bringing interdisciplinary perspectives to my thesis! Last but not the least, the scientist who inspired me most at USC, I would like to thank the (late) Irving Biederman for teaching me cognititve neuroscience, for not canceling the class even though I was the only (fortunate) registered student all semester, and for setting a mighty example of scientific curiosity by attending lab meetings and learning and teaching new topics right up to his last moments of life. 1 Information Sciences Institute at the University of Southern California iii Beyond my committee, I was fortunate to find many amazing professors from all disciplines at campus, each teaching me a unique view of the world: Jonathan May for his intellectually stimulating and excitedly quick-paced lectures; Filip Ilievski for his uniquely encouraging advising style; Greg Ver Steeg and Aram Galstyan for the most information-dense class on representation learning; my internship hosts Ashwin Kalyan (Allen Institute for Artificial Intelligence), Rohan Mukherjee, Hann Wang, and Arijit Biswas (Amazon Alexa Conversations), Stephanie Hyland, Shruthi Bannur, and Flora Liu (Microsoft Research Cambridge); Peggy Bustamante (Annenberg) for teaching me data journalism; David Belasco (Marshall) whose class ‘Taking the Leap’ I often gatecrashed (apologies) for a weekly dose of inspiration; Jill Kickul (Marshall) for introducing me to social entrepreneurship; Albert Napoli (Marshall) for judging and improving my presentation skills; Carl Collins (Dornsife) for training me and building up my confidence; and Joshua Goldstein and Yi-Hsien Liu (Dornsife) for exposing me to the Chinese culture and language. I have learnt as much from my labmates as from my advisors, and I would therefore like to thank them for gracing my years with their companionship: Thamme Gowda for being my first mentor on campus; Sami Abu-El-Haija for tutoring and entertaining my half-baked research ideas; Lee Kezar for their friendly vibe as well as insightful discussions; Justin Cho for grounding our shared entreprenurial zeal in frequent conversations; Ehsan Qasemi for your light-hearted presence; labmates Pei Zhou, Kexuan Sun, Pegah Jandaghi, Kian Ahrabian, Dong-Ho Lee, Yifan Jiang, and Eric Boxer for our fun socials and group meeting feedback sessions; co-interns Swaroop Mishra (ASU), Pedro Sanchez (Edinburgh), Nur Yildrim (CMU), Adam Jelley (Edinburgh), Eloi Alonso (ETH); and my amazingly talented student collaborators Saurabh Ghanekar, Xiaoyuan Zhu, Minda Hu, Erdong Hu, Husain Zafar, Naren Teja Divvala, Dipesh Kumar (Microsoft), and Eshaan Aggarwal and Harsh Pandey (IIT BHU). iv Inspired by the ancient Indian student Eklavya who worshipped his guru Dronacharya despite never being enrolled in the latter’s classes, I would like to extend my thanks to a few instructors who guided me not by a direct interaction but by being a distant, inspiring example: Andrej Karpathy for, like me, writing science fiction 2 and open source software for literature review 3 as well as supporting my arguments against subword tokenization; Marvin Minsky for his book Society of Mind (Minsky, 1988) which unravelled many mysteries of the human mind; Jeff Hawkins for his book A Thousand Brains (Hawkins, 2021) which connected neuroscience to artificial intelligence and inspired my appreciation of the transformer attention architecture 4 and Mor Geva for upholding the highest standards in scientific integrity and intellectual curiosity. I am extremely happy to have shared this journey with my buddy, Kushal Chawla, for making USC a home away from home, as well as for the many insightful discussions on the nature of consciousness! My flatmates Sujay Patil, Mandar Deshpande, Umang Gupta, and Lokesh Krishna all helped sprinkle a part of their personalities in mine while I acclimated to the country. Finally, I am forever indebted to my mom and dad, my brother Siddharth, my sister-in-law Saumya Garg, and my partner Drishti Kalhans for believing in me, especially in times when I would doubt myself and for supporting me with their endless love and affection. 2My AI short story: Word (Thawani, A. 2020) 3Living Surveys, literature review tool 4My blog summarizing and extending the book’s analogies. v Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Numeracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1.1 Our Taxonomy of Tasks . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1.2 Survey of Existing Tasks . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Numeracy Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2.1 String Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2.2 Real Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Existing Methods for Numerate Language Models . . . . . . . . . . . . . . 12 2.1.3.1 String-based methods . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3.2 Real-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.4 Survey of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.5 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1.1 Character/Byte-level . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1.2 Beyond word level . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1.3 Visual segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1.4 Learnt subword segmentation . . . . . . . . . . . . . . . . . . . 24 2.2.1.5 What is missing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Chapter 3: Dataset for Numeric Language Modeling . . . . . . . . . . . . . . . . . . . 27 vi Chapter 4: Effects of Tokenization on Number Estimation in Text . . . . . . . . . . 31 4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Downstream zero-shot transfer . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Neuron Probing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 5: Effects of Tokenization-enhanced Numeracy on Literacy . . . . . . . . . 44 5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 6: Beyond Numeracy: Tokenization of Multi Word Expressions . . . . . . 53 6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.1 BPE beyond words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.2 Adding MWEs with PMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.4.1 Words combine in Diverse ways . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4.2 Where do MWEs help NMT? . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 7: Towards End-to-End Learnt Tokenization . . . . . . . . . . . . . . . . . . 65 7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.2 Representation Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.3 Predicting Rare Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3.4 Number Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.4 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.4.1 Training Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.4.2 Generation Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 vii Chapter 8: Downstream effects: Learnt Tokenization on Machine Translation . . . 85 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.1.1 Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.1.2 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4.1 How well do learnt tokenizers encode source and decode target text? . . . 94 8.4.2 How robust are tokenizers to data scarcity? . . . . . . . . . . . . . . . . . 96 8.4.3 How robust are tokenizers to noise? . . . . . . . . . . . . . . . . . . . . . . 97 8.4.4 Do different tokenizers specialize in different kinds of translations? . . . . 98 8.4.5 Can we quantify the morphological preference of tokenizers? . . . . . . . 99 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 9: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1.1 Downstream effects: Financial Language Modeling . . . . . . . . . . . . . 105 9.1.2 Interpretability: Aggregation of Symbols inside LLMs . . . . . . . . . . . . 108 9.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 10: Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 List of Tables 2.1 Seven numeracy tasks, arranged along the axes of (rows) granularity - exact vs approximate, and (columns) units - abstract vs grounded. We also list downstream applications requiring a similar granularity of numeracy. Chapter 3 onwards, we will focus on only grounded, approximate numeracy, in particular the task of Numerical Language Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 viii 2.2 An overview of numeracy in NLP: Each row is a method (§2.1.3), arranged as per our taxonomy (§2.1.2) split by string and real, further branching into three dimensions each. The last seven columns correspond to the seven subtasks of numeracy (§2.1.1.2), split by Exact and Approximate granularity (§2.1.1.1). The cells point to representative (not exhaustive) works that have experimented with a given method (row) on a given task (column). Notes: Prototype* is encoder-only but reuses embeddings for the decoder (Jiang et al., 2020). GMM** has been discretized (Spithourakis and Riedel, 2018) as well as continuous valued (Berg-Kirkpatrick and Spokoyny, 2020). . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Literature Review of existing tokenization methods along several dimensions. Compress? Is the input string chunked into bigger units? Generate? Whether or not the model can generate new unseen tokens? Learnt? Is the tokenization learnt end-to-end with other parameters? Word Boundary? Is the word boundary considered or treated as just another token? Conv: Convolution. Attn: Attention. 25 2.4 Literature Review of alternative tokenizers and what they control for. We work with Factorizer, the only tokenizer that controls for all dimensions and makes it possible to compare directly against a subword vocabulary. . . . . . . . . . . . . . 26 3.1 Examples sentences from Wiki-Convert along with annotated numbers and units. 30 4.1 Multiple approaches to masked number prediction or number decoding. Color Coding: Tokens in the vocabulary of BERT (Devlin et al., 2019). New tokens. Continuous-valued predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 ix 4.2 Main Results: Order of magnitude accuracy (E-Acc) and Log Mean Absolute Error (LMAE) over the test set of three datasets, contrasting the three degrees of freedom for improving numeracy of language models. NA denotes subword models which were unable to emit valid numbers for at least 50% of the examples. Best and second best results are bold-faced and underlined respectively. . . . . . 36 4.3 Results on our new dataset: Wiki-Convert . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Example predictions from FinNews dev set. Ours (DExp-GM) and DExp estimate numbers in the same order of magnitude as ground truth (Ans.) but the estimate of the subword baseline (Sub) is far off. . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Downstream performance of our main methods over fact estimation for solving Fermi Problems. NA denotes subword models which were unable to emit valid numbers for at least 50% of the examples. . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Numerate language models perform better at masked word prediction. BERT: Default BERT baseline. Exp: BERT with exponent embeddings (§5.1). . . . . . . . 45 5.2 Results on masked word prediction over two datasets and six methods, averaged over two runs with different random seeds. PPL = Perplexity. LValue = Log Value. Exp = Exponent embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Results on masked word prediction in non-numeric contexts from Wikicorpus, averaged over three runs. PPL = Perplexity. Exp = Exponent embeddings. . . . . 50 5.4 Qualitative error analysis over Wiki-Convert, showing examples where the Default baseline fails and the Exponent embeddings correctly predict the masked word. Asterisk* indicates: same as ground truth. . . . . . . . . . . . . . . . . . . . 51 x 5.5 Results over a sample of Wiki-Convert test set, stratified by the kind of token masked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.1 Example tokenizations of MWEs (bigrams, trigrams, skip-grams) in our implementation. Raw = original sentence, Tok = tokenized form. Typical BPE tokens are colored yellow and MWEs are colored green. . . . . . . . . . . . . . . . 54 6.2 Different methods of adding MWEs to a BPE vocabulary on NMT across two language pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 Training, validation and testing datasets along with sentence count in each set. . . 58 6.4 Left: Coverage of the top 5 most frequent English MWEs (PMI-based), extracted from the first language pair and (coverage) evaluated over the second. Coverage of a token is defined as the fraction of target (English) sentences containing the token. Right: The top five MWEs of each type (PMI except when labelled Freq). . 58 6.5 Do MWEs help more when added to the source-side, the target-side or both? Each cell reports Dev/Test BLEU scores over Hi-En dataset only. Baseline scores without MWEs are 15.6 / 14.4 respectively. . . . . . . . . . . . . . . . . . . . . . . 62 7.1 Trade-offs involved when choosing tokenizers: Subword vs Bytes/Characters vs eByte/eChar (ours). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2 Statistics for our language modeling datasets. See Section 7.2.2 for more details. . 74 7.3 Word Prediction Accuracies (Acc %) for different languages and tokenizers. See Section 7.3.1 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 xi 7.4 Word Prediction Accuracies for different representative power (number of prefix CLS tokens) per word in our end-to-end byte/char-tokenized (Tok) models. Up to 45% higher prediction scores are available for a marginal increase in memory (Mem in GBs) of about 20 MBs. See Section 7.3.2 for details. . . . . . . . . . . . . 78 7.5 Case study: Word Prediction Accuracies for Russian across tokenizers, stratified by Rare and Frequent words. See Section 7.3.3 for details. . . . . . . . . . . . . . . 79 7.6 Number Estimation results on Numeracy dataset across tokenizers. % Num = the percentage of times the model predicts a number, over which the next two metrics are calculated. EAcc = Exponent Accuracy. MdAPE = Median Absolute Percentage Error. See Section 7.3.4 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8.1 Literature Review of alternative tokenizers and what they control for. We work with Factorizer, the only tokenizer that controls for all dimensions and makes it possible to compare directly against a subword vocabulary. . . . . . . . . . . . . . 87 8.2 Summary of our Training, Development, and Test Datasets on ten language pairs. 91 8.3 Comparison of different source tokenizers with the target fixed (xx → BPE-8K) across 12 language pairs, along with standard deviations over 3 runs with different random seeds. English source experiments are averaged over three different test sets, resulting in higher variance. We also report (micro) averages grouped by source language. Takeaway: Factorizer does not outperform BPE but is better than Bytes when translating Arabic. . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4 BLEU scores on Arabic → English stratified by lengths. Factorizer particularly outperforms when the reference is either very short or very long. . . . . . . . . . 98 xii 8.5 Representative samples of Arabic → English translations - three examples each of where Factorizer significantly outperforms BPE and vice versa (as measured by Sentence BLEU). We highlight the winning system’s successes and failures. . . . . 103 9.1 Patterns found in tokens that share similar codes trained with eByte over Wiki-Convert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xiii List of Figures 3.1 Distribution of numbers in Wikipedia. . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Histogram of mantissas for the 58K sentences in FinNews dev set (true) and corresponding predictions by DExp (pred). See Section 4.3.1 for details. . . . . . . 38 4.2 Precision Recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Different number encoders as described in Section 5.1. Notes: † 2.517 is log10 329. ‡ 329 collapses to the 130𝑡ℎ bin out of 200 log-scaled bins within our range of [1𝑒 − 4, 1𝑒 + 6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.1 Qualitative error analysis over Hi-En test set, showing examples comparing the Baseline and the Skip-Gram augmented model, where the skip-gram (This · is) occurs in the latter’s predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Top scoring multi-word expressions extracted from the training corpora. . . . . . 63 7.1 Overview of our proposed simple end-to-end tokenized autoregressive language model. A transformer encoder compresses the variable number of base units (here, characters) into n=1 CLS tokens per word. Dotted characters are the previously predicted tokens at inference, and when training they are the ground truth. . . . 69 xiv 7.2 Self-attention visualized across (1) Byte/Char-level models, (2) Subword/Wordlevel models, and (3) Our proposed end-to-end tokenization modules (word encoder; base LM decoder; word decoder) with character base. Blue blocks indicate self attention mask. @ symbol indicates a prepended CLS token per word. 70 8.1 Left: Non-concatenative morphology in Arabic often interleaves letters within the root (Clark et al., 2022). Right: Subword tokenization in GPT-4 instead only captures ‘contiguous’ sequences of characters. . . . . . . . . . . . . . . . . . . . . 85 8.2 Pictorial depiction of how the Factorizer (Samuel and Øvrelid, 2023) learns token embeddings as an autoencoder (seen here reconstructing the word ‘do’) where the final summed embeddings of the word are used to evaluate on syntactic tasks. We specifically borrow these intermediate codes labelled Factorizer 258 and Factorizer 794 in our chapter as stand-in replacements for a BPE tokenizer, enabling fair comparison on NMT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3 BLEU scores on target side with the source side fixed as (xx ← BPE-8K) across six language pairs. BPE consistently outperforms Factorizer. . . . . . . . . . . . . . . 95 8.4 Data Scarcity: BLEU scores over Ar → En with different source-side tokenizers (target-side fixed at BPE 8k). Most tokenizers lose performance in a low resource setting but Factorizer 794 gains the most. . . . . . . . . . . . . . . . . . . . . . . . 96 8.5 Ar→En relative BLEU scores (100 denotes noiseless5 ) with varying degrees of noise added to the test source sentences. Factorizer performance relatively degrades less than BPE as noise increases. . . . . . . . . . . . . . . . . . . . . . . 97 8.6 Representativeness in English. BPE 794 codes well represent more root forms than Factorizer 794 (rightwards is better). See Section 8.4.5 for details. . . . . . . . . . . 100 xv 8.7 Representativeness in Arabic. Factorizer 794 codes well represent more root forms than BPE 794 (rightwards is better). See Section 8.4.5 for details. . . . . . . . . . . 100 xvi Abstract Natural language is a sequence of symbols. Language models (LMs) are powerful at learning sequence patterns. The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into symbolic codes (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. I scrutinize case studies where these heuristics fail and my recommended improvements: for example when representing numbers, multi-word expressions, and non-concatenative languages. I present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristics. This model also outperforms character-level tokenization, promising up to 4x speed up in inference and 6x speed up in training. I show the benefits of aggregating symbols for language modeling, and investigate key aspects of symbol use in LMs: 1. Aggregating on the number line improves both numeracy and literacy of language models 2. We can learn to aggregate symbols given a corpus with improved language modeling and approximate numeracy 3. Learning to aggregate symbols helps downstream performance in certain application areas like neural machine translation of non-concatenative languages xvii Chapter 1 Introduction The remarkable progress of large language models (LLMs) in recent years has captivated the world with their ability to generate human-like text, engage in conversation, and tackle complex tasks across diverse domains. From crafting creative fiction to summarizing intricate scientific articles, LLMs like ChatGPT (OpenAI, 2023) are rapidly reshaping human-computer interaction and pushing the boundaries of artificial intelligence. However, behind these impressive feats lies a crucial yet often overlooked component: tokenization. This process, which transforms human-readable text into a sequence of symbolic codes interpretable by machines, forms the very foundation of how LLMs understand and interact with language. Traditionally, tokenization has relied on subword segmentation, a set of manually engineered heuristics that break down words into smaller units based on their frequency of occurrence. Examples of popular subword tokenization algorithms include Byte Pair Encoding (BPE) (Sennrich et al., 2016a), WordPiece (Wu et al., 2016b), and Unigram (Kudo, 2018a). While efficient and effective for many NLP tasks, particularly in English, this approach suffers from inherent limitations when applied to diverse languages and complex linguistic phenomena. Subword tokenization struggles to capture the nuances of: 1 1. Non-concatenative morphologies: Languages like Arabic, Hebrew, and Finnish utilize non-concatenative morphology, where morphemes are interwoven within a word rather than simply appended as prefixes or suffixes. Subword tokenization, primarily designed to identify contiguous character sequences, often fails to capture these intricate interleaving patterns, as we will explore in detail in Chapter 7. 2. Numeracy: Numbers, despite their ubiquity in natural language, are often inconsistently segmented into subwords, leading to poor representations of their scalar magnitude and relationships on the number line. This deficiency limits the ability of LLMs to grasp numerical concepts, perform arithmetic operations, and reason about quantitative information, as we will demonstrate in Chapters 3, 4, and 5. 3. Technical vocabulary: Subword tokenization often struggles to handle technical domains such as biomedical documents, legal texts, and financial articles, as well as different languages. These domains utilize specialized vocabularies with unique terminologies and jargons, requiring custom tokenization strategies to effectively capture their semantic richness (Chapters 8 and 9). Recognizing these limitations, the research community has begun exploring alternative tokenization strategies that move beyond the rigid confines of subword segmentation and aim to learn more expressive and inclusive representations of language. This dissertation extends this exciting frontier of NLP research, focusing on the potential of aggregating symbols for language modeling. We argue that by embracing tokenization strategies that learn to aggregate symbols based on their inherent meaning and relationships, we can empower LLMs to: 2 1. Develop robust numeracy skills: By explicitly representing the scalar magnitude of numbers and their relationship to other numbers on the number line, we can improve their ability to estimate quantities, perform arithmetic operations, and reason about numerical facts. Chapters 4 and 5 showcase how simple yet effective changes in vocabulary can significantly enhance the numeracy of LLMs, leading to improved performance on tasks such as masked number prediction and numerical fact estimation. 2. Enhance language understanding: We demonstrate that improved numeracy can positively impact literacy, leading to better word prediction and a more nuanced understanding of semantic relationships even in contexts without explicit numerical cues. Chapter 5 explores this interplay between numeracy and literacy, showing how a numerate language model can achieve better word prediction accuracy and capture subtle semantic distinctions. Chapter 6 expands the scope of challenges in subword segmentation to multi-word expressions, further motivating the need for alternative tokenization strategies. 3. Embrace linguistic diversity: By leveraging vector quantization and word boundary information, we can move beyond the limitations of subword segmentation and develop tokenizers that are better suited to capture the rich diversity of morphologies across languages. In Chapter 7, we introduce an end-to-end learned tokenization strategy that utilizes word boundary information to compress character-level representations into efficient and expressive word embeddings. We demonstrate its effectiveness for language modeling across multiple languages, achieving significant gains in next-word prediction accuracy compared to both subword and character-level models. 3 4. Increase robustness and efficiency: Learned tokenizers, by capturing long-range dependencies and utilizing contextual information, can exhibit greater robustness to noise and data scarcity. Chapter 8 explores the downstream effects of learned tokenizers on machine translation, showcasing their resilience to misspellings and improved performance on very short and very long sentences. Additionally, we analyze their preference for representing non-concatenative morphologies, shedding light on their potential for inclusive NLP. This dissertation contributes to the ongoing conversation surrounding tokenization in NLP, advocating for a shift from manually engineered heuristics to a more data-driven and context-aware approach. Through a series of controlled experiments, we offer empirical evidence for the benefits of aggregating symbols for language modeling and provide insights into the intricate interplay between tokenization, numeracy, and language understanding. We hope this work inspires the development of more flexible and inclusive tokenization techniques, ultimately empowering LLMs to better understand and interact with the rich diversity of human language. 4 Chapter 2 Background This chapter presents the foundational concepts of numeracy and tokenization in natural language processing (NLP), setting the background for a deeper exploration in the following chapters. We begin by scrutinizing the multifaceted nature of numeracy, proposing a novel taxonomy that categorizes numeracy tasks based on granularity (exact vs. approximate) and units (abstract vs. grounded). We then survey existing NLP methods for representing and manipulating numerical information, highlighting their strengths and limitations across various tasks. Subsequently, we shift our focus to tokenization, the fundamental process of segmenting text into discrete units. We review prevalent tokenization strategies, including byte pair encoding and character-level approaches, and discuss their implications for language model performance. Finally, we identify key research gaps and motivate the need for more dynamic and context-aware tokenization methods, particularly in the context of numerical language modeling. 5 2.1 Numeracy Recent NLP progress towards numeracy has been sporadic but encouraging. In this section, we survey prior work and highlight the kind of numeracy targeted (e.g., arithmetic, measurement, numeration) as well as the kind of representation used (e.g., value embeddings, DigitRNNs). We provide the first NLP-centric taxonomy of numeracy tasks (Section 2.1.1) and of number representations (Section 2.1.2) for the reader to succinctly comprehend the challenge posed by numeracy. We synthesize key takeaways (Section 2.1.5) and propose a unifying vision for future research. 2.1.1 Tasks There are several different aspects of numeracy. The DROP dataset alone offers a wide variety of numeric reasoning questions such as retrieval-based (How many yards did Brady run?), countbased (How many goals were scored? given a comprehension describing multiple goals), and simple arithmetic (How many years after event 1 did event 2 occur? given dates of both events). Besides downstream applications, there have also been probing experiments to evaluate whether NLP models can decode numbers from strings (e.g., 19 to 19.0), or estimate quantities (e.g., how tall are lions?). Such a diverse range of abilities are usually all referred to collectively as numeracy, which gives rise to confusion. We limit this abuse of terminology and provide a neat taxonomy for arranging the different tasks proposed under numeracy. 6 Benchmarking or Probing Tasks Downstream Abstract Grounded Applications Exact Simple Arithmetic (2+3=5) AWP (2 balls + 3 balls = 5 balls), Question Answering, Exact Facts (birds have two legs) Science Problems Approx Numeration (‘2’ = 2.0), Measurement (dogs weigh 50 lbs), Sarcasm Detection, Magnitude (‘2’ < ‘5’) Numerical Language Modeling Numeral Categorization Table 2.1: Seven numeracy tasks, arranged along the axes of (rows) granularity - exact vs approximate, and (columns) units - abstract vs grounded. We also list downstream applications requiring a similar granularity of numeracy. Chapter 3 onwards, we will focus on only grounded, approximate numeracy, in particular the task of Numerical Language Modeling. 2.1.1.1 Our Taxonomy of Tasks Drawing from work in cognitive science (Feigenson et al., 2004), we propose the following two dimensions to organize tasks within numeracy: 1. Granularity: whether the encoding of the number is (1) exact, e.g., birds have two legs, or (2) approximate, e.g., Jon is about 180 cms tall. Parallel work in cognitive science has explored whether exact and approximate numeracy require two specialized modules (Feigenson et al., 2004) or could be handled with a single representation (Cordes et al., 2001). 2. Units: whether the numbers are (1) abstract, e.g., 2+3=5, or (2) grounded, e.g., 2 apples + 3 apples = 5 apples. While abstract mathematical tasks are easy to probe and create artificial datasets for, numbers grounded in units are challenging since they need to be understood in the context of words. 7 2.1.1.2 Survey of Existing Tasks We now describe 7 numeracy tasks, arranged according to our taxonomy in Table 2.1, as well as downstream tasks (right-most column in the table). Simple Arithmetic is the task of addition, subtraction, etc. over numbers alone. It is convenient to create synthetic datasets involving such math operations for both masked (Geva et al., 2020) and causal language models (GPT-3 Brown et al. 2020). Numeration or Decoding refers to the task of mapping a string form to its numeric value, e.g., 19 to 19.0. Within NLP, this task is set up as a linear regressor probe over a (frozen) representation of the string. Numeration has been probed for in static word embeddings (Naik et al., 2019), contextualized language models (Wallace et al., 2019), and multilingual number words, e.g., nineteen or dix-neuf (Johnson et al., 2020). Magnitude Comparison is the ability to tell which of two (or more) numbers is larger. For language models, this has been probed in an argmax setup (choose the largest of five numbers) as well as a binary classification task, e.g., given 23 and 32, pick the label 1 to indicate that 32 > 23 (Naik et al., 2019; Wallace et al., 2019). Arithmetic Word Problems (AWP) are the grounded version of simple arithmetic that we find in school textbooks, e.g., Mary had two cookies. She gave one away. How many does she have left? There exist several NLP datasets on math word problems (Amini et al., 2019; Saxton et al., 2019; Roy and Roth, 2015; Hendrycks et al., 2021). Exact Facts in the context of numeracy involves commonsense knowledge such as dice have 6 faces or birds have two legs. An approximate sense of quantity would be of little help here since assertions like dice have 5 faces or birds have three legs are factually incorrect. Two recent 8 datasets for numeric commonsense facts are Numbergame (Mishra et al., 2020) and NumerSense (Lin et al., 2020). Measurement Estimation is a task in psychology in which subjects are asked to approximately guess measures of objects along certain dimensions, e.g., number of seeds in a watermelon or weight of a telephone (Bullard et al., 2004). VerbPhysics (Forbes and Choi, 2017) is a benchmark of binary comparisons between physical attributes of various objects, e.g., ball <𝑠𝑖𝑧𝑒 tiger. DoQ (Elazar et al., 2019) is a web-extracted dataset of Distributions over Quantities, which can be used as a benchmark for language models’ measurement estimation abilities (Zhang et al., 2020). Lastly, MC-TACO (Zhou et al., 2020) is a collection of temporal-specific measurement estimates, e.g., going for a vacation spans a few days/weeks. Numerical Language Modeling in its literal sense is not a task but a setup, analogous to masked/causal language modeling for words. Other tasks could be modeled as numeric language modeling, e.g., arithmetic (2+3=[MASK]) and measurement estimation (lions weigh [MASK] pounds). In practice, numerical language modeling refers to the task of making numeric predictions for completing unlabelled, naturally occurring text. This will be the focus of our experiments in the following sections. Word predictions in language modeling are typically evaluated with classification metrics such as accuracy or perplexity. Numeric predictions, on the other hand, are evaluated with regression metrics such as mean absolute error, root mean squared error, or their log and percentage variants (Berg-Kirkpatrick and Spokoyny, 2020). Spithourakis and Riedel (2018) also propose an Adjusted Perplexity metric to cancel the effect of the out-of-vocabulary rate on the perplexity of numeric tokens. 9 Downstream Applications for numeracy are abound. Dubey et al. (2019) detect sarcasm in tweets based on numbers. Chen et al. (2020) identify claims in financial documents using alternative number representations and the auxiliary task of numeral understanding or categorization (Chen et al., 2018). Similarly, simple arithmetic and math word problems serve as auxiliary tasks for GenBERT (Geva et al., 2020) towards improving its score on the DROP QA benchmark. 2.1.2 Numeracy Methods Analogous to our taxonomy of subtasks in the previous section, here we attempt to arrange the wide variety of alternative number representations proposed in recent literature. We limit our analysis to methods of encoding (numbers → embeddings) and/or decoding numbers (embeddings → numbers). We do not discuss, for example, methods that use symbolic reasoning (Andor et al., 2019) or modify activation functions to enhance numeracy (Trask et al., 2018). A typical example of the base architecture could be BERT (Devlin et al., 2019), one of the most widely used transformer-based language models. We assume that there exists an independent parallel process of mapping words into embeddings, such as subword tokenization followed by lookup embeddings in BERT. We look at two kinds of representations: string-based and real-based. Real-based representations perform some computation involving the numerical value of the number. The string-based representations instead see numbers in their surface forms; they must assign arbitrary token IDs and look up their embeddings to feed into the architecture. 10 2.1.2.1 String Based By default, language models treat numbers as strings, the same as words. However, within string representations, one could tweak simple changes: Notation: The number 80 could be written in Hindu-Arabic numerals (80), Roman numerals (LXXX), scientific notation (8e1), English words (eighty), or with base 20 as in French (quatrevingts). Nogueira et al. (2021) exclusively study the effect of many such notation choices in language models, on the task of simple arithmetic. Tokenization: Word level tokenizations are ineffective for numbers, since they are likely to map most numbers to an UNK token, except for a few commonly occuring ones (e.g., 1, 2, 5, 10, 100). Other possibilities are subword tokenizations like BPE and WordPiece, as well as character (or digit) level tokenizations. Pooling: The pooling dimension of variation springs up after analyzing the effect of tokenization. With subword and character level tokenizations, a single number may now correspond to multiple tokens, e.g., 100 segmented into 10-0 or 1-0-0. Prior work (Spithourakis and Riedel, 2018) has argued for using RNNs or CNNs to instead pool the embeddings of these tokens into a single embedding before feeding to the language model. The default way that language models see numbers are the same as words, hence no pooling is applied. 2.1.2.2 Real Based Real-based number encoders can be expressed as 𝑓 : R → R 𝑑 whereas decoders can be expressed as 𝑔 : R 𝑑 → R. Real-based methods proposed in literature can vary on account of direction (whether 11 they encode, decode or both), scale (linear vs log), and discretization (binning vs continuous valued). Direction: Some proposed methods are encoder-only, e.g., DICE (Sundararaman et al., 2020), while some can be decoder-only, e.g., those requiring sampling from a parameterized distribution (Berg-Kirkpatrick and Spokoyny, 2020). In later chapters, we consider methods of encoding (Chapter 5) and decoding (Chapter 4) numbers on different tasks. Scale: Inspired by cognitive science literature (Dehaene, 2011), several methods have attempted to model numbers in the log (instead of linear) scale, i.e., to perform mathematical operations on the logarithm of the number to be represented. The first operation in a log-scaled 𝑓 is 𝑙𝑜𝑔(·) and the last operation in a log-scaled 𝑔 is 𝑒𝑥𝑝(·). We discuss more scales in the following subsection, such as the stabilized log scale (Jiang et al., 2020) and the learned scale/flow (Berg-Kirkpatrick and Spokoyny, 2020). Discretization: Training continuous value functions for a large range of numbers turns out to be practically infeasible (Wallace et al., 2019). Some real-based methods first bin numbers before learning embeddings for each bin. These bins could be on the linear scale (0-10, 10-20, 20-30, . . .) or the log scale (0.01-0.1, 0.1-1, 1-10, . . .), and the lookup embeddings can be learnt by the regular cross entropy (Chen et al., 2020) or dense cross entropy (Zhang et al., 2020). 2.1.3 Existing Methods for Numerate Language Models Having established dimensions of variance of number representations, we describe some key string-based and real-based methods used in prior work. Table 2.2 depicts these methods as individual rows, with the first three columns showing their position in our taxonomy (§ 2.1.2). The 12 Exact Approximate Arith Facts AWP Num Mag Meas LM𝑁 String-Based Notation Tokenization Pooling Word Vectors Decimal Word NA W+19 W+19 W+19 G+19 J+20 Contextualized Decimal Subword No W+19 L+20 G+20 W+19 W+19 Z+20 SR18 GenBERT Decimal Char No G+20 G+20 NumBERT Scientific Subword No Z+20 Z+20 DigitRNN/CNN Decimal Char Yes W+19 W+19 W+19 SR18 DigitRNN-sci Scientific Char RNN BS20 Exponent Scientific Word NA BS20 Real-Based Scale Direction Binning DICE Linear Enc-only No S+20 S+20 S+20 Value Linear Both No W+19 W+19 W+19 Log Value Log Both No W+19 W+19 W+19 Z+20 MCC Log Dec-only Yes Z+20 Log Laplace Log Dec-only No BS20 Flow Laplace Learn Dec-only No BS20 DExp Log Dec-only No BS20 GMM Linear Dec-only Both** SR18 GMM-proto Linear Enc-only* No J+20 J+20 J+20 J+20 SOM-proto Log Enc-only* No J+20 J+20 J+20 J+20 Table 2.2: An overview of numeracy in NLP: Each row is a method (§2.1.3), arranged as per our taxonomy (§2.1.2) split by string and real, further branching into three dimensions each. The last seven columns correspond to the seven subtasks of numeracy (§2.1.1.2), split by Exact and Approximate granularity (§2.1.1.1). The cells point to representative (not exhaustive) works that have experimented with a given method (row) on a given task (column). Notes: Prototype* is encoder-only but reuses embeddings for the decoder (Jiang et al., 2020). GMM** has been discretized (Spithourakis and Riedel, 2018) as well as continuous valued (Berg-Kirkpatrick and Spokoyny, 2020). last seven columns correspond to the seven tasks (§ 2.1.1.2), with each cell denoting a representative work that introduce it. 2.1.3.1 String-based methods Word Vectors & Contextualized Embeddings Word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018a), and BERT (Devlin et al., 2019) have been probed as baselines against several contending methods. 13 GenBERT Geva et al. (2020) present GenBERT, a question answering model with pretrained BERT serving as both its encoder and decoder. GenBERT tokenizes numbers at the digit level, and is finetuned on auxiliary tasks of arithmetic word problems and simple arithmetic. NumBERT Zhang et al. (2020) pretrain BERT from scratch over a modified dataset such that all numbers have been converted into scientific notation, i.e., 314.1 is expressed as 3141[EXP]2). NumBERT hence follows a scientific notation, subword tokenization, and no pooling.1 DigitRNN, DigitCNN Spithourakis and Riedel (2018) and Wallace et al. (2019) experimented with poolingof digit embeddings into a single embedding representing the full number. Both used RNNs as well as CNNs for pooling. DigitRNN-sci & Exponent (Embedding) Berg-Kirkpatrick and Spokoyny (2020) use a scientific notation variant of DigitRNNs (which we refer to as DigitRNN-sci in Table 2.2), as well as a simpler alternative: exponent embedding. The latter merely learns a lookup embedding for the exponent, completely ignoring the mantissa. 2.1.3.2 Real-based methods DICE Determinisitic Independent-of-Corpus Embeddings (Sundararaman et al., 2020) is an attempt to handcraft number encoder 2 𝑓 so as to preserve the relative magnitude between two numerals and their embeddings. Given two scalars 𝑖 and 𝑗, and their embeddings 𝑓 (𝑖) and 𝑓 (𝑗), the cosine distance between 𝑓 (𝑖) and 𝑓 (𝑗) is intended to monotonically increase/decrease with the Euclidean distance between 𝑖 and 𝑗. DICE is offered as not only a deterministic encoding but also as an auxiliary loss function for softly training number embeddings alongside, say, SQuAD (Rajpurkar et al., 2016) 1Pooling as described in § 2.1.2.1. 2Number encoder-decoder as defined in § 2.1.2.2. 14 Value Embedding The most intuitive parameterized encoder for real numbers is one that feeds the scalar magnitude of the number through a shallow neural network. The converse of value embedding is to learn a shallow neural network mapping 𝑔 : R 𝑑 → R. This decoder is simply the probe used for decoding/numeration task. The idea of projecting number magnitudes into an NLP model that otherwise inputs only lookup embeddings may appear flawed. But Vaswani et al. (2017) have (rather successfully) encoded positional information into transformers using both learned embeddings (similar to Value) and fixed ones (similar to DICE). Log Value Wallace et al. (2019) also experiment with a log-scaled value encoder in addition to the one on a linear scale. Zhang et al. (2020) experiment with a log value decoder for measurement estimation, which they call the RGR (regress) method. Log scaling has a neuroscientific inspiration since observations of human (and animal) understanding of numbers is better modelled by a log-scale representation (Dehaene, 2011). Log Laplace In contrast to the point estimate output of the RGR decoder, models can also be used to parameterize a distribution over numbers. Such a formulation is helpful when estimating approximate quantities. Vectors representing some context can be used to parameterize, say, the mean and variance of a Gaussian or Laplace distribution. Berg-Kirkpatrick and Spokoyny (2020) instead transform the space being modeled by parameterizing the location parameter of a Log-Laplace distribution 𝐿(𝑋, 1) where 𝑋 is the context representation of unmasked tokens, in a masked (numerical) language modelling setup. When inferring or decoding a number, they sample a point 𝑧 ~𝐿(𝑋, 1) and exponentiate it, such that the output is 𝑒𝑥𝑝(𝑧). Flow Laplace The expressivity of number decoders can be expanded or contracted by merely parameterizing a different distribution. Berg-Kirkpatrick and Spokoyny (2020) propose a more 15 expressive decoder where instead of the log scale, the model learns its own density mapping. After sampling 𝑧 ~𝐿(𝑋, 1), the output is transformed to 𝑒𝑥𝑝( 𝑧−𝑎 𝑏 ) 𝑐 , where 𝑎, 𝑏, and 𝑐, are also parameters emitted by the same model. MCC or multi-class classification is another number decoder which outputs a distribution, but a discrete one: over log-scaled bins of numbers, e.g., 1-10, 10-100, and so on (Zhang et al., 2020). Previously described decoders either output a point estimate or a unimodal distribution, thus failing to hedge its predictions for a multimodal ground truth. Given a masked number prediction problem We went to the restaurant at [MASK] p.m., MCC is better equipped to estimate two peaks: one around lunch time (say, 1-2 p.m.) and another around dinner (say, 7-9 p.m.). Discrete Latent Exponent (DExp) is another potentially multimodal distribution (BergKirkpatrick and Spokoyny, 2020) where the model parameterizes a multinomial distribution for the exponent (similar to MCC) and uses it to sample an exponent 𝑒, which then acts as a latent variable for emitting the mean 𝜇 of a Gaussian (standard deviation fixed at 0.05). This Gaussian is finally used to sample the output number 𝑧 ~𝑁 (𝜇, 0.05). GMM Another attempt to circumvent the unimodal Gaussians or point estimates is to learn a Gaussian mixture model. Spithourakis and Riedel (2018) learn a mixture of 𝐾 Gaussians by pretraining their means (𝜇𝑖 ) and variances (𝜎𝑖 2 ) over the training corpus with Expectation Maximization algorithms, while the mixing weights 𝜋𝑖 are derived from the model. Next, to sample a single number from the GMM probability mass function 𝑞(𝑢) = Í𝐾 𝑖=1 𝜋𝑖𝑁 (𝑢; 𝜇𝑖 ; 𝜎𝑖), the authors first sample the precision (number of decimal places) from yet another Gaussian and use that to discretize the probability mass function into equal sized bins, over which the probabilities are summed. If the sampled precision is, say 2, then the probability of emitting a number 3.14 is given by ∫ 3.145 3.135 𝑞(𝑢)𝑑𝑢. This likelihood estimate is used to train a causal language model. 16 Berg-Kirkpatrick and Spokoyny (2020)’s GMM implementation is slightly different: it alters the last inference step by sampling directly from the mixture of Gaussians, as they did with Log Laplace, Flow Laplace, and DExp. GMM-prototype by Jiang et al. (2020) similarly pretrains (with EM/hard-EM) the mean, the variances, but also the mixture weights 𝜋𝑖s of a GMM over the training corpus. They then learn 𝐾 prototype embeddings 𝑒𝑖s corresponding to the 𝐾 Gaussians. When encoding a new numeral 𝑛, its (input) embedding is calculated as: 𝐸(𝑛) = Í𝐾 𝑖=1𝑤𝑖 .𝑒𝑖 , where the weights are induced from the GMM: 𝑤𝑖 = 𝑃 (𝑍 = 𝑖|𝑈 = 𝑛) = 𝜋𝑖𝑁 (𝑛; 𝜇𝑖 ; 𝜎𝑖) Í𝐾 𝑗=1 𝜋𝑗𝑁 (𝑛; 𝜇𝑗 ; 𝜎𝑗) Thus the difference between GMM and GMM-prototypes is that after fixing mean and standard deviations of the Gaussian mixtures, in GMM the model learns to predict the mixture weights 𝜋𝑖 for each individual number prediction, whereas in GMM-prototype, 𝜋𝑖 ’s are frozen and the model learns prototype embeddings 𝑒𝑖 ’s. Note that prototype embeddings are encoder-only.To decode numbers, the authors implement weight-sharing across input and output embeddings, similar to how word vectors are trained (Mikolov et al., 2013), i.e., finding out which of the numerals in the corpus has the closest embedding. SOM-prototype GMM-prototype, in effect, merely use the mixture of Gaussians to infer prototypes and to get the weights 𝑤𝑖 ’s. Jiang et al. (2020) tried another variant by identifying prototype numerals with Self Organizing Maps (Kohonen, 1990) and by defining the weights as: 𝑤𝑖 = |𝑔(𝑥𝑖) − 𝑔(𝑛)|−1 where 𝑥𝑖 is the 𝑖th prototype, 𝑛 is the number to be encoded, and 𝑔 is a log-based squashing function. 17 2.1.4 Survey of Results Having organized the landscape of numeracy tasks and methods, we now present come key results for each numeracy task in NLP from previously published experiments over a subset of the described number representations: Abstract Probes Word Embeddings vastly outperform random embedding baselines on abstract probes such as numeration, magnitude comparison, and sorting (Wallace et al., 2019; Naik et al., 2019). DICE, Value and Log Value embeddings excel at these probes, which makes intuitive sense given that they explicitly encode the numbers’ magnitude - although Value embeddings do not easily extrapolate to larger numbers, possibly due to instability in training. The best number encoders with respect to these probes were found to be DigitCNNs, and character-tokenized models, e.g., ELMo, in general outperform subword ones, e.g., BERT (Wallace et al., 2019). Arithmetic GPT-3 (Brown et al., 2020) performs extremely well at zero shot simple arithmetic, as long as the number of digits in the operands are low. The tokenization scheme could be the cause for limited extrapolation, since language models get better at arithmetic when numbers are tokenized at the digit/character level (Nogueira et al., 2021; Wallace et al., 2019). For arithmetic word problems, state of the art solvers rely on predicting an equation, which is then filled in with specific numeric values from the question (Patel et al., 2021), altogether bypassing the need for encoding numbers into embeddings. Masked Language Modelling Zhang et al. (2020) show that BERT pretrained over datasets where numbers are in scientific notation (NumBERT) converges to the same loss as BERT on masked language modelling objective, and scores nearly the same on GLUE language understanding benchmarks. For (causal) numeric language modelling, Spithourakis and Riedel (2018) show 18 that Gaussian Mixture Models are the best decoders. For (masked) numeric language modelling, Berg-Kirkpatrick and Spokoyny (2020) show that modelling the mantissa in scientific notation may be an overkill, since exponent embeddings alone outperform DigitRNN-sci over financial news and scientific articles. Measurement Estimation Zhang et al. (2020) train a regression probe to predict measurements of objects over the CLS embeddings of BERT/NumBERT. Given a template-lexicalized sentence such as “the dog is heavy,” the model must predict the weight of a typical dog, against ground truth from the Distribution over Quantities dataset (Elazar et al., 2019). They find that NumBERT is a better text encoder than BERT for measurement estimation, the only difference between them being the notation used by the respective pretraining corpora. They also experiment with two number decoders: MCC (multi-class classification) and RGR (regression / Log Value embedding). MCC performs better when trying to predict Distributions over Quantities - perhaps due to the ground truth resembling the predicted gaussians - but not on VerbPhysics - where the ground truth is less noisy. Lastly, even static word embeddings like GloVe have been shown to contain enough knowledge of measurement estimates to contrast two objects, e.g., classifying whether a car is bigger/heavier/fasster than a ball (Goel et al., 2019). Exact Facts BERT and RoBERTa capture limited numerical commonsense, evident over NumerSense (Lin et al., 2020) sentences such as a tricycle has [MASK] wheels, with the answer choices limited to the integers 0-10. Results can be further improved by finetuning over a Wikipediaextracted dataset of numeric information. Mishra et al. (2020) find commonsense question answering to be one of the hardest among their Numbergame challenge, using the NumNetv2 model 19 (Ran et al., 2019) which is commonly used for DROP question answering. Both of these experiments evaluate on exact match metrics, hence it remains to be seen if representing approximate magnitudes yields benefit in modelling numeric facts. 2.1.5 Recommendations Based on the above results, we now synthesize key insights into a set of directed takeaways to guide practitioners’ design of number representations: Rule of thumb for string-based methods? Scientific notation is superior to decimal notation (Zhang et al., 2020) since models can learn to attend mostly to the exponent embedding rather than the mantissa (Berg-Kirkpatrick and Spokoyny, 2020). Character level tokenization outperforms subword level (Nogueira et al., 2021; Wallace et al., 2019; Geva et al., 2020). Pooled representations (DigitRNN, DigitCNN) lack a controlled study with unpooled ones (NumBERT, GenBERT) which makes it hard to proclaim a winner among the two. Rule of thumb for real-based methods? Log scale is preferred over linear scale (Zhang et al., 2020; Jiang et al., 2020; Wallace et al., 2019; Berg-Kirkpatrick and Spokoyny, 2020), which makes intuitive sense but lacks as rigorous a study as has been undertaken in the cognitive science community (Feigenson et al., 2004). Regarding discretization, Zhang et al. (2020) show that binning (dense cross entropy loss) works better than continuous value prediction (MAE loss) on datasets where ground truth distributions are available. Lastly, modeling continuous predictions is notoriously hard for large ranges (Wallace et al., 2019) but Spithourakis and Riedel (2018) offer a way of binning such distributions by picking a precision level. 20 Encoding vs Decoding numbers? In our simplified discussions above, we avoid differentiating between methods for encoding and decoding numbers. Value Embedding, for instance, can be used to encode numbers (projecting scalars onto vector space) as well as to decode numbers (collapsing a vector into a scalar). On the other hand, manually-designed encoders like DICE are not easily reversible into decoding methods. Even with reversible methods, the encoders and decoders must usually be independently parameterized, unlike the input and output word embeddings which often share weights (Press and Wolf, 2016). Prototype embeddings by Jiang et al. (2020) are an exception, which share input/output embeddings for a fixed vocabulary of numbers. Can we mix-and-match multiple methods? Given the wide range of number representations, an obvious next step is to try an ensemble of embeddings. Berg-Kirkpatrick and Spokoyny (2020) show that for encoding numbers, exponent embeddings added to DigitRNN (scientific notation) embeddings barely outperforms the exponent embeddings alone. Similar experiments with a mix of real and string methods are yet to be seen. Which methods for which tasks? Based on our taxonomy of tasks in Table 2.1, abstract tasks are good early probes for the grounded ones, e.g., finetuning GenBERT (Geva et al., 2020) on simple arithmetic helps it do well on downstream question answering, and the high scores of DICE (Sundararaman et al., 2020) on numeration and magnitude comparison are an indicator of similar boosts on (numeric) language modelling. With respect to granularity, real-based methods work well for approximate tasks such as measurement estimation and language modeling (Zhang et al., 2020; Berg-Kirkpatrick and Spokoyny, 2020) but not for exact tasks like arithmetic word problems or commonsense. DigitRNNs are broad-purpose number encoders, whereas distribution modeling methods like DExp are effective at decoding numbers. 21 2.2 Tokenization Almost all natural language processing (NLP) begins with tokenization (Mielke et al., 2021a). Sequences of characters are (mostly deterministically) segmented into discrete tokens, each of which has a lookup embedding in an enormous vocabulary matrix. Statistical NLP methods, similar to other forms of machine learning at the time, relied on feature extraction from these tokens, in the form of n-gram occurrences or part-of-speech tags or other representations of syntax. All of these pipelines have over time been replaced with end-to-end learning using recurrent neural networks (RNNs) or transformers, however the tokenization schemes remain static, deterministic, and manually engineered. Below are the most commonly used tokenization strategies in NLP alongside some lesser known alternatives. We refer the interested reader to Mielke et al. (2021b) for a deeper survey on tokenization in NLP. 2.2.1 Byte Pair Encoding The modern workhorse of tokenization in NLP is a heuristic atop byte representations called Byte Pair Encoding. Starting from a base of 256 bytes and a training corpus, the most frequently occurring byte pairs are incrementally merged, e.g., t+h →th, th+e→the, and so on. Nearly all large language models today (Touvron et al., 2023a,b; Groeneveld et al., 2024; Jiang et al., 2023) rely on Byte Pair Encoding as their base tokenizer, with different number of merges. GPT3 (Brown et al., 2020) uses a vocabulary of 50,257 BPE tokens (50,000 merges and a special token) while GPT4 (OpenAI, 2023) pushes it further to 100,000 merges. Some recent 22 work has challenged the subword tokenization schemes. Table 8.1 highlights the different kinds of tokenizations existing in prior work and positions our work uniquely among them. 2.2.1.1 Character/Byte-level Most natural language text on the internet is encoded using UTF-8 byte encodings, therefore a byte-level representation of text makes for a convenient option. Their vocabulary size is restricted to a mere 256 possible bytes, and most Latin languages require a single byte per character. ByT5 (Xue et al., 2022), CANINE (Clark et al., 2022), and SubChar (Si et al., 2021) propose using very small fixed-length units such as characters, bytes, or glyph strokes instead of dynamic-length subwords or words. This often comes at the expense of larger sequence lengths and more compute requirements, especially for a transformer architecture which typically has a complexity of O (𝑛 2 ) in number of input tokens. 2.2.1.2 Beyond word level CodeBPE (Chirkova and Troshin, 2022) and Multi Word Expressions (Kumar and Thawani, 2022; Zaninello and Birch, 2020; Rikters and Bojar, 2017) show promise in yet larger tokens that cross word boundaries, e.g., a vocabulary with single tokens for the strings “for i in range” or “New York City” respectively. 2.2.1.3 Visual segmentation Yet another line of work (Rust et al., 2022; Salesky et al., 2021) renders text as images before feeding them to CNNs, doing away with tokenization altogether and showing gains in robustness to spelling or printing errors. 23 2.2.1.4 Learnt subword segmentation Finally, some methods (Mofijul Islam et al., 2022; Kaushal and Mahowald, 2022; Pinter et al., 2021; Tay et al., 2021; Provilkov et al., 2020; Wang et al., 2021) parameterize the process of tokenization by pooling character n-grams or randomly choosing one of the many ways to segment a given word. A recent preprint on machine translation by Sreedhar et al. (2022) proposes a method called WSF, perhaps closest to ours, except that they only use the word boundary fusion at encoder stage. Our independent analysis focuses on language modeling instead and also generates text in parallel using end-to-end attention based tokenization. 2.2.1.5 What is missing We note however that all of the above, while challenging the static subword tokenization schemes, are nevertheless merely segmenting text on the surface form. 24 Method Citation Compress? Generate? Learnt? Word level? GPT Radford et al. (2018) Lookup Yes No Yes ByT5 Xue et al. (2022) None Yes No No MANTa Godey et al. (2022) Segment Yes Yes No RetVec Bursztein et al. (2023) Conv. No Yes Yes FastText Bojanowski et al. (2017) Conv. No Yes Yes ELMo Peters et al. (2018b) Conv. No Yes Yes CharBERT El Boukkouri et al. (2020) Conv. No Yes Yes CharFormer Tay et al. (2021) Conv. No Yes No LOBEF-nCF Sreedhar et al. (2022) None Yes Yes No LOBEF-WSF Sreedhar et al. (2022) None Yes Yes Yes CANINE Clark et al. (2022) Conv. Yes Yes No MegaByte Yu et al. (2023) Dense Yes Yes No Ours Attn. Yes Yes Yes Table 2.3: Literature Review of existing tokenization methods along several dimensions. Compress? Is the input string chunked into bigger units? Generate? Whether or not the model can generate new unseen tokens? Learnt? Is the tokenization learnt end-to-end with other parameters? Word Boundary? Is the word boundary considered or treated as just another token? Conv: Convolution. Attn: Attention. 25 Tokenizer Citation Architecture Vocab Size Parameters Train Data FastText Bojanowski et al. (2017) No No No No ELMo Peters et al. (2018b) No No No No CharBERT El Boukkouri et al. (2020) Yes No No Yes CharFormer Tay et al. (2021) No No Yes Yes LOBEF Sreedhar et al. (2022) No No No Yes CANINE Clark et al. (2022) No No No Yes ByT5 Xue et al. (2022) No No Yes Yes MegaByte Yu et al. (2023) No No No Yes RetVec Bursztein et al. (2023) No No No Yes eByte/eChar Thawani et al. (2023a) No No Yes Yes Factorizer Samuel and Øvrelid (2023) Yes Yes Yes Yes Table 2.4: Literature Review of alternative tokenizers and what they control for. We work with Factorizer, the only tokenizer that controls for all dimensions and makes it possible to compare directly against a subword vocabulary. 26 Chapter 3 Dataset for Numeric Language Modeling Numbers are an integral part of text. To understand a simple sentence like I woke up at 11, we need not just literacy but also numeracy. We must decode the string 11 to the quantity 11 and infer 11 to denote a time of the day, probably 11 a.m. We need commonsense to reason that 11 a.m. is quite late in the morning. This interpretation of 11 is strongly contextual, as I earn $11 per month evokes different units and value expectations. Note how the semantics remains the same for both sentences if 11 was replaced by 10, i.e., the context is tolerant to some variability. Numbers are everywhere. Reasoning with quantities and counts is crucial to understanding the world. Evolutionary learning has given numerical cognition skills to several animals, including human beings (Dehaene, 2011). Our ancient ancestors furthered numeracy by developing multiple number systems, similar to but independent from the evolution of languages. Numeracy is an essential skill for language understanding, since numbers are often interspersed in text: the 6 million pages in English Wikipedia have over 150 million numbers. Numbers are diverse. The distribution of regex-extracted numbers in Wikipedia alongside their reported frequencies, both on the log scale, is shown in Figure 3.1. Some trends clearly stand out, for example Benford’s Law which states that numbers with smaller leading digits (1, 2, 27 Figure 3.1: Distribution of numbers in Wikipedia. . . .) occur more frequently than larger ones (. . . , 8, 9). Similarly, there is an exceptionally high occurrence in numbers that correspond to recent years since 2000. We will show in the upcoming chapters that such distributional patterns help some number representations do better than others. The distinct parallel line-like patterns observed in the figure correspond to precision, i.e., the number of digits after the decimal. The key takeaway is that numbers have diverse and interesting distributional patterns for NLP representations to exploit. Numbers are neglected. In NLP, however, numbers are either filtered out explicitly during preprocessing (Graff et al., 2003), or treated the same as words, often collapsing them into an UNK token. Subword tokenization approaches like BPE (Sennrich et al., 2016b) and WordPiece (Wu et al., 2016a) instead retain numbers, but split them into arbitrary tokens, for example 1234 might be split into two tokens as 12-34 or 123-4 or 1-234. Numbers are challenging. Recent work has shown that these are suboptimal number representations (Wallace et al., 2019; Zhang et al., 2020). On the DROP Question Answering 28 benchmark, BERT performs five times worse when the answer is a number instead of a span of text (Dua et al., 2019). Relatively simple strategies like switching from subword to char-level tokenization (Geva et al., 2020), or from decimal to scientific notation (Zhang et al., 2020) already boost performance. Such results warrant a deeper study into the best number representations. Numbers are important. Given the ubiquity of numbers and their fundamental differences with words, enabling NLP systems to both read (encode) and write (decode) numbers effectively is beneficial for domains like scientific articles (Spithourakis and Riedel, 2018) and financial documents (Chen et al., 2019; Jiang et al., 2020). Number understanding is also useful to detect sarcasm (Dubey et al., 2019) and to model dialogues involving price negotiations (Chawla et al., 2020). Motivated by the ubiquity, diversity, neglect, difficulty, and importance of numbers in NLP, this work takes a step towards better contextual numeracy for language models. Existing benchmarks for numeric language modelling have been extracted automatically using regular expressions (Spithourakis and Riedel, 2018; Berg-Kirkpatrick and Spokoyny, 2020; Chen et al., 2019), and hence have no mechanism to filter out nominal numbers, such as zip codes, phone numbers, or proper nouns (e.g., “Boeing 747”). To allow for a more meaningful comparison, we propose Wiki-Convert, a novel benchmark for numeric language modeling extracted from English Wikipedia. Wiki-Convert consists of a curated set of sentences where the numbers are not extracted by regex matching, but annotated by humans, i.e., the editors who wrote the Wikipedia article in the first place. Specifically, we make use of Convert,1 a template that contributors have used over 3.2 million times in Wikipedia to seamlessly convert between different units of measurement. 1https:wikipedia.org/wiki/Help:Convert 29 For example, {{Convert|50|𝑚𝑖|𝑘𝑚}} is parsed in Wikipedia as 50 miles (80 kilometers). Concretely, we extract over 3 million Convert occurrences in over 1 million sentences from the May 2020 dump of English Wikipedia. We preprocess them, retaining only the 30 most frequent units (e.g., miles, acres, pounds), and filter out sentences with multiple number annotations. The end result is a dataset of over 900,000 sentences along with an annotated <number-unit> tuple. We find Wiki-Convert to be a useful benchmark not only for numeric language modelling but also for measurement estimation tasks (Zhang et al., 2020; Zhou et al., 2020). Example Wiki-Convert annotations are shown in Table 3.1. Sentence Number Unit minutes into the descent burn, and [NUM] [UNIT] above the surface of the Moon ... 6000 feet Bethel average around [NUM] [UNIT] of precipitation. 100 inches U-559 had a displacement of [NUM] [UNIT] while submerged 871.0 tonne ... temperature ranges from . . . in January to [NUM] [UNIT] in July 73.9 °F ... the rocket made the slow [NUM] [UNIT] journey to the launch pad 3 miles Table 3.1: Examples sentences from Wiki-Convert along with annotated numbers and units. Owing to its expansive scope, high precision annotations, and challenging semantics, WikiConvert has been cited, used, and extended heavily in the research community (Spokoyny et al., 2022a; Thawani et al., 2021a; Spokoyny et al., 2022b; Thawani et al., 2023a; Huang et al., 2023). The dataset is free to download from HuggingFace 2 , PapersWithCode 3 , and GitHub 4 . 2https://huggingface.co/datasets/usc-isi/WikiConvert 3https://paperswithcode.com/dataset/wikiconvert 4https://github.com/avi-jit/numeracy-literacy 30 Chapter 4 Effects of Tokenization on Number Estimation in Text The standard practice in the language modeling community is to process numbers in exactly the same manner as words. This second class treatment of numbers leads to their inaccurate representation and therefore, limited numerical understanding of large-scale language models (LMs). To illustrate, a number like $799 is subword tokenized (Sennrich et al., 2016b) as 79 and ##9. Such a tokenization method, by construction, prevents accurately modeling the relationship of this number with other numbers on the number line say, $800, as the surface forms share no common tokens. Many alternatives have been proposed to capture the scalar magnitude of numbers (Thawani et al., 2021b). All number decoders proposed to capture the magnitude of numbers fall into one of the following categories, corresponding to modifications to 1) notation (e.g. scientific vs decimal) or 2) vocabulary (e.g. introducing new tokens that denote all numbers within a specified range) or 3) architectural changes (e.g. directly regressing to a number). Table 4.1 shows the various approaches on the task of masked number prediction. We show in this section that applying the tokenization-level changes leads to near state-ofthe-part performance requiring no additional pre-training or architectural changes. This is a 31 Degree Expected Predictions for: of Change iPhone [MASK] costs $[MASK]. (default) iPhone 13 costs $ 79 ##9 . Notation iPhone 13 costs $ 7 . 99 e 2 . Vocabulary iPhone 10-100 costs $ 100-1000 . Architecture iPhone 13.0000 costs $ 799.0000 . Table 4.1: Multiple approaches to masked number prediction or number decoding. Color Coding: Tokens in the vocabulary of BERT (Devlin et al., 2019). New tokens. Continuous-valued predictions. surprising yet useful finding, which can substantially speed up adoption of numeracy into any given language model. Any blackbox LM can be made numerate by simply tokenizing numbers on the number line. We further evaluate the number representation schemes on their ability to generalize to downstream tasks – in this case, numerical fact estimation in the context of solving fermi problems (Kalyan et al., 2021a). We find trends similar to the task of masked number prediction demonstrating the utility of the simple yet effective tokenization scheme in the decoding setting. 4.1 Methods The NLP community has recently proposed several ways of improving the numeracy of language models, including architectural and notation interventions. We focus on the task of approximately decoding numbers in MNP setting, as opposed to exact numeracy required in, say, arithmetic word problems. In this subsection, we introduce existing classes of number decoders and discuss the trade-offs involved in using them. 32 Default. The default way that language models decode numbers is the same way that they predict words, one subword at a time. For example, the number 329 could be decoded as two individual tokens 3 and ##29. Change of Notation. Here, the numbers are represented in an alternate notation – e.g. scientific notation as opposed to decimal notation. Note that this approach does not require changing any of the other components of language modeling. In this work, we consider the following variations: Scientific: Using scientific notation in lieu of the usual decimal notation was first proposed by Zhang et al. (2020). In this work, we closely follow their version with minor implementation level changes. Importantly, note that following the notation change, the tokenizer nevertheless splits it into subwords as before. Digits: Here, the number is split into its constituent digits or characters, e.g., 329 becomes 3 2 9. This approach offers a consistent decomposition of numbers into digits, as opposed to the arbitrary tokens from subword segmentation and has been proven effective on simple numeric probes as well as arithmetic word problems Geva et al. (2020). Change of Vocabulary Unlike words, the notion of distance or similarity is more obviously defined for numbers in terms of their separation on the number line, a cognitive tool that human beings are known to intuitively use to process numeracy (Dehaene, 2011). This forms the basis of a change of vocabulary: numbers within a specified range are collapsed into a single token – at the cost of precise representation of numbers. This approach to tokenizing the number space is analogous to stemming of words. Stemming is a simple technique to collapse low frequency words to their lemma in order to curtail the vocabulary size, e.g., playing, player and played all collapse into the token for play. Similarly, exponent embeddings collapse multiple numbers into a single token covering a range of numbers. While this approach has already been used in the 33 context of encoding numbers (Berg-Kirkpatrick and Spokoyny, 2020; Thawani et al., 2021a), our work is the first to use and study this approach when outputting or decoding numbers. Change in Architecture. Finally, several recent methods have modified the underlying language model to emit continuous values when predicting numbers. At their core, they operate by regressing to the desired number conditioned on the language context. See Berg-Kirkpatrick and Spokoyny (2020) for a thorough comparison within this class of methods. We directly compare against their best variant: Discrete Latent Exponents, which first models the exponent part of a number as a multinomial, and then uses it to parameterize a truncated log normal distribution to sample the mantissa as a continuous value. 4.2 Implementation Details The key contribution of this work is to highlight the possibility of achieving near state-of-the-art results from Berg-Kirkpatrick and Spokoyny (2020) with a much simpler method. Thus, we used the same hyperparameters and extend their code for most of our experiments1 . Please refer to Section 3 in their paper for dataset details. The base language model is 12-layer BERT-base implemented using HuggingFace transformers v2.3.0 (Wolf et al., 2020). We train all models with stochastic gradient descent using a batch-size of 32 for 10 epochs. We use early stopping with a patience of three on the validation loss. For pretrained BERT encoder experiments, we use two learning rates 3e5 and 1e2 for all pretrained parameters and newly added parameters respectively. Each of our experiments took a few hours on NVIDIA Quadro RTX 8000 GPU (one 1https://github.com/dspoka/mnm 34 per experiment). We report results on the same random seed across models. We were able to reproduce DExp result scores exactly up to 1 decimal place. Further, note that we only compare number decoders and not the encoders – therefore, when numbers are present in the input, standard encoding schemes are used. For approaches with changes to vocabulary and architecture, we follow (Berg-Kirkpatrick and Spokoyny, 2020) and use exponent embeddings to encode numbers (with no shared parameters with the decoder’s tokens) and for approaches with notation changes, we use subword tokenization. With scientific notation, a previous approach NumBERT (Zhang et al., 2020) denotes 329 as 329 [EXP] 2. However, we find that representing the same instead as 3𝑥29 where ‘x’ is the common English alphabet, works better in practice. 4.3 Experiments and Results We evaluate different number decoders and evaluate them on the task of masked number prediction (MNP). Before analyzing their performance, we first describe the datasets, models and metrics used. Dataset and Metrics. We follow Berg-Kirkpatrick and Spokoyny (2020) to finetune and evaluate our models on four datasets – Financial News Articles (FinNews), its subset containing mostly price-based numbers (FinNews-$), Scientific Articles (Sci), and number-annotated sentences from Wikipedia (Wiki-Convert); all numbers in these datasets lie between 1-1016. We evaluate using two metrics – a) Exponent Accuracy (E-Acc) that checks whether the predicted answer is of the same order of magnitude as the ground truth and b) Log Mean Absolute Error (LMAE). 35 FinNews FinNews-$ Sci-Docs Metrics E-Acc ↑ LogMAE↓ E-Acc↑ LogMAE↓ E-Acc↑ LogMAE↓ Baselines Train-Mean 1.02 7.69 6.02 4.68 0.01 8.81 Train-Median 5.52 1.88 10.58 2.66 49.52 0.83 Train-Mode 24.23 2.02 8.13 6.30 49.52 1.00 Subword-Pad8 63.56 0.68 29.05 1.36 68.02 0.68 Notation-change Digit-Pad17 52.23 0.93 33.04 1.37 55.12 0.91 Scientific-Pad8 52.53 0.84 NA NA 71.14 0.66 Vocabulary-change DExp-AM 74.40 0.65 57.14 0.93 81.16 0.51 DExp-GM 73.70 0.60 56.99 0.92 81.32 0.44 Architecture-change Berg-Kirkpatrick and Spokoyny (2020) DExp 74.56 0.50 57.50 0.89 81.17 0.39 Table 4.2: Main Results: Order of magnitude accuracy (E-Acc) and Log Mean Absolute Error (LMAE) over the test set of three datasets, contrasting the three degrees of freedom for improving numeracy of language models. NA denotes subword models which were unable to emit valid numbers for at least 50% of the examples. Best and second best results are bold-faced and underlined respectively. 36 Wiki-Convert E-Acc ↑ LogMAE ↓ Baselines Train-mean 0.0007 5.7310 Train-median 0.3426 0.9077 Train-mode 0.2695 1.7687 Vocabulary-change DExp-AM 0.5470 0.6576 DExp-GM 0.5466 0.6285 Architecture-change DExp 0.5454 0.6087 Table 4.3: Results on our new dataset: Wiki-Convert Baselines. Our primary baseline is the standard approach of subword tokenization. We require each number prediction to be 8 tokens long, with appropriate padding, to be able to fairly represent all numbers in our range. Additionally, we evaluate on three trivial baselines that make a constant prediction corresponding to the mean, median, and mode of all numbers in the training set. Models. We compare against both notation-level changes i.e. scientific and digit, with a padding of 8 and 17 respectively. Among the approaches the introduce architectural changes, we compare against the state-of-the-art discrete exponent model DExp (Berg-Kirkpatrick and Spokoyny, 2020). Finally, we compare against two variations that introduce vocabulary level changes – both discretize the number line with logarithmic-ally sized bins (with base 10). The two variants differ in how the mantissa is chosen – either the arithmetic mean (5) or the geometric mean (√ 10), named DExp-AM and DExp-GM, respectively. Please see Appendix 4.2 for more implementation details. 37 Figure 4.1: Histogram of mantissas for the 58K sentences in FinNews dev set (true) and corresponding predictions by DExp (pred). See Section 4.3.1 for details. 4.3.1 Main Results Table 4.2 summarizes our main results. We find that the straightforward, change of notation approaches are inferior to the subword baseline. This is in contrast to prior work on extrapolating the arithmetic abilities of language models by simple notation changes (Nogueira et al., 2021; Geva et al., 2020). This result suggests that simple pre-processing changes like changes of notation are not sufficient for contextual understanding of numbers for language modeling. Next, we find that while DExp model is the best performing method, approaches that instead make changes to the vocabulary (DExp-AM/GM) are a close second. Notably, over 90% of the gain in E-Acc, and up to 94% of the drop in LogMAE, from subword to DExp models for FinNews corpus, is achievable without modelling the mantissa at all! We perform the same experiments on our newly introduced dataset Wiki-Convert and find similar trends in Table 4.3. To study the cause behind this finding, we dig deeper into the only component that differentiates our proposed DExp-AM/GM models from the state-of-the-art DExp: mantissas. We plot the mantissas from DExp’s predictions against the ground truth (FinNews dev set) in Figure 4.1. 38 FY2018 Earnings per share view $ Daniels maintains Cohen paid her $130000 via Input [MASK] , revenue view . . . essential consultants to hush up a [MASK] sexual encounter with Trump. Ans. 1.63 2006 Sub 1000000 1 DExp 2.695 2792.66 Ours 1-10 1k-10k Table 4.4: Example predictions from FinNews dev set. Ours (DExp-GM) and DExp estimate numbers in the same order of magnitude as ground truth (Ans.) but the estimate of the subword baseline (Sub) is far off. We find that in the naturally occurring datasets, the leading digit of numbers is likely to be small (Benford’s Law) and the mantissa peaks around 2, owing to the frequent mentions of years (2000 − 2022) from our current millennium (Recency Bias). This rather simple distribution of numbers in the real world helps our static DExp-AM/GM models perform at par with the state-of-the-art DExp without making any architectural changes to the underlying language model. Finally, Table 4.4 shows some representative examples from FinNews dataset where the Subword baseline’s estimate is far off from the ground truth, whereas predictions of both DExp and DExp-GM are within the correct order-of-magnitude. 4.3.2 Downstream zero-shot transfer Given the trends observed in masked number prediction, we are interested in analyzing the utility of these models on a downstream number prediction task. For this purpose, we evaluate on numerical fact estimation. We pick the Fermi Problems dataset (Kalyan et al., 2021b), which 39 Fermi-Real FinNews FinNews-$ Sci 510 egs. E-Acc ↑ LogMAE↓ E-Acc↑ LogMAE↓ E-Acc↑ LogMAE↓ Sub-Pad8 26.11 2.38 16.07 3.17 25.89 2.84 Dig-Pad17 18.79 2.58 NA NA 23.27 2.87 Sci-Pad8 24.78 2.93 NA NA 20.09 2.75 DExp-AM 32.21 2.19 24.38 2.42 27.29 2.42 DExp 32.21 2.13 25.06 2.51 28.19 2.40 Fermi-Syn FinNews FinNews-$ Sci 3437 egs. E-Acc ↑ LogMAE↓ E-Acc↑ LogMAE↓ E-Acc↑ LogMAE↓ Sub-Pad8 28.72 2.89 19.12 3.25 38.93 2.83 Dig-Pad17 21.66 2.93 NA NA 40.73 2.87 Sci-Pad8 25.75 3.06 NA NA 27.05 2.76 DExp-AM 39.08 2.61 40.85 2.42 46.86 2.52 DExp 39.22 2.44 41.36 2.44 47.60 2.48 Table 4.5: Downstream performance of our main methods over fact estimation for solving Fermi Problems. NA denotes subword models which were unable to emit valid numbers for at least 50% of the examples. consists of challenging estimation problems such as How many tennis balls fit in a school bus?. Solving such questions require estimating numeric facts e.g. volume of tennis ball and length of bus. We evaluate each of our models on such annotated facts provided as part of both the real and synthetic datasets part of the fermi problem dataset. The task setup is of masked number prediction as before, e.g., “the size of a tennis ball is [MASK] cubic centimeters." We report E-Acc and Log MAE as before, in Table 4.5. We find similar trends as in §4.3.1 i.e. change of notation is insufficient while vocabulary-change approaches are closely behind approaches that make architectural changes – highlighting that most of the gains could be retained by simply tokenizing in number space. 40 4.4 Neuron Probing In this subsection, we further probe how numeracy is stored in the feed forward layers of language models. Previous work along these lines (Geva et al., 2021) have shown promise in interpreting the knowledge stored in language models by finding individual neurons in feed forward layers that are triggered by specific patterns of input. We apply this analysis to find some such neurons, if any, which can effectively and efficiently capture the magnitude of a masked number. Figure 4.2 shows the Precision-Recall curves for the state-of-the-art DExp model on the task of predicting masked numbers has an exponent of 3, i.e. it is between 1000 and 10,000. We say a neuron has been triggered if it is among the top 50 activated ones (out of 3072) in that layer for the input mask token. Recall is then defined as the fraction of times when this neuron was triggered for all masked numbers with an exponent of 3. Precision is defined as the fraction of times when the exponent was 3 for all the times that the specific neuron was triggered. We find that some individual neurons, such as the 650th neuron in the 10th layer of finetuned DExp has a very high precision and recall. It alone can predict whether the order of magnitude is 3, with an F1 score of above 0.7. This analysis shows promise in further interpreting the results of number representations in language models and possibly even causing interventions to update its beliefs (Dai et al., 2021). 4.5 Discussion Subword tokenization, the standard approach to representing numbers leads to inaccurate numerical understanding. In this work, we analyze number representation approaches that make 41 Figure 4.2: Precision Recall curve 42 notational (e.g. scientific vs. decimal), vocabulary (i.e. tokenizing on the number line), and architectural changes (i.e. regressing to the number). We find that tokenization on the number line achieves near or better than state-of-the-art results while requiring minimal intervention to the language model. It will allow language models to conveniently improve their numeracy, including cases where users may not have access to the model’s architecture and are only provided a typical finetuning regime with small changes to the tokenizer’s vocabulary. Finally, we find similar trends in the challenging setting of numerical fact estimation for solving Fermi Problems – indicating that vocabulary-change is sufficient to represent approximate numbers effectively with minimal effort. 43 Chapter 5 Effects of Tokenization-enhanced Numeracy on Literacy Numbers account for 6.15% of all unique tokens in English Wikipedia (Jiang et al., 2020), yet NLP systems have traditionally either removed numbers during preprocessing or replaced them with a single uninformative UNK token. Recent models such as BERT retain them but learn individual token embeddings for hundreds of numbers. Moreover, subword tokenization approaches end up segmenting numbers into possibly suboptimal splits, e.g., 4500 is seen as (4, 500) or (45, 00) depending on the specific tokenizer used. The human brain, in contrast, automatically maps numbers to their approximate magnitude on the number line (Dehaene, 2011). NLP systems that fail to account for the scalar values that numbers denote may correspondingly lack in comprehension. Recent work has empirically demonstrated the inefficacy of existing NLP methods in numeric reasoning tasks (Wallace et al., 2019). Alternative number representations have been proposed, such as projecting the number’s magnitude into a vector space (Sundararaman et al., 2020) or switching to a scientific notation (Zhang et al., 2020; Berg-Kirkpatrick and Spokoyny, 2020). For a representative summary, please refer to the previous chapter 4 on improving numeracy by enhanced tokenization in language models. 44 BERT Exp The [mask] weighs 100 lbs. statue bomb The [mask] weighs 10000 lbs. statue car Table 5.1: Numerate language models perform better at masked word prediction. BERT: Default BERT baseline. Exp: BERT with exponent embeddings (§5.1). Figure 5.1: Different number encoders as described in Section 5.1. Notes: † 2.517 is log10 329. ‡ 329 collapses to the 130𝑡ℎ bin out of 200 log-scaled bins within our range of [1𝑒 − 4, 1𝑒 + 6]. We observe that this line of work goes from literacy to numeracy, i.e., helping language models gain numerate skills such as simple arithmetic (Geva et al., 2020), measurement estimation (Zhang et al., 2020), and masked number prediction (Berg-Kirkpatrick and Spokoyny, 2020). This chapter, on the other hand, addresses the converse question: Do alternative number representations enhance the ability of language models to understand/predict words? We investigate this question through experiments with several representative number encoders, proposed in prior work. We develop and release Wiki-Convert, a large, novel dataset of numberannotated sentences, which helps us disentangle the nominal occurrences of numbers. Our experiments show the positive impact of numeracy on a language model’s literacy, as illustrated 45 in Table 5.1. The default BERT model is unable to update its predictions for an object whose weight is switched from 100 to 10,000. However, our numeracy-aware method is able to predict that 100 lbs is a typical weight of a bomb, while 10,000 lbs is that of a car, due to its understanding of magnitudes and their association with words. We also find this improved literacy in contexts without numbers. 5.1 Methods Our hypothesis is that language models will benefit from specialized encoders which explicitly make use of the number’s magnitude. In line with both cognitive science research (Dehaene, 2011) as well as recent work on numeric representations within NLP, we propose that numbers and words be encoded differently by a language model. Words can continue to be subword-tokenized and encoded via lookup embeddings, but number encoding should consider the magnitude. We consider three representative methods from prior work which make use of a number’s magnitude to encode it in vector space, as well as three baselines (marked with *), each of which is depicted pictorially in Figure 5.1. 1. Value embeddings (Wallace et al., 2019) project the scalar magnitude of the number to be encoded into a vector space of same dimensionality as the lookup word embeddings. We use a 1-hidden layer feed forward layer as the projection network, with a configurable number of hidden neurons. 2. LogValue is the log-scaled extension of Value, wherein the projection of the scalars is preceded by a log(·) function (Wallace et al., 2019). 46 3. Exp or Exponent embeddings are lookup matrices for the exponent part of a scientific number notation, e.g., 2 in 3.29𝑒2 (Berg-Kirkpatrick and Spokoyny, 2020). Note how this method collapses numbers into equally spaced bins on the log scale. Although the authors used a specific implementation based on decimal scientific notation, we generalize this method to an arbitrary number of bins. Note that this is the same as Vocabulary-change method described in the previous chapter. 4. Default* is the usual way that BERT (Devlin et al., 2019) encode numbers: subword tokenization (Schuster and Nakajima, 2012) followed by lookup embeddings. 5. None* removes all numbers from the sentence during preprocessing. This is analogous to the baseline implementation in Berg-Kirkpatrick and Spokoyny (2020), except they mask the numbers instead of filtering them out. 6. Num* method learns a single lookup embedding for all numbers, reflecting how traditional NLP replaced any number occurrence with a single token (Graff et al., 2003), such as UNK or NUM. This method can be seen as exponent embeddings with a single bin, into which all numbers are collapsed. 5.2 Experiments We operationalize our research question by fine-tuning the same pretrained masked language model (BERT-base-uncased) with each of the six encoding methods (Section 5.1) on the task of masked word prediction. Thus when we say numeracy, we refer to the ability of the three number-specific encoders to take into account a number’s magnitude and not its surface form. And when we say literacy, we refer to the masked word prediction ability of a language model, assuming it to be a valid proxy for downstream performance on other literacy tasks. 47 The methods encode annotated numbers into 768-dimensional vectors. Words, as well as numbers which are not annotated, are encoded by the usual subword tokenization followed by lookup embeddings. Value and LogValue methods each have a single hidden layer with 200 neurons, and exponent embeddings have 200 bins. We manually tuned the hyperparameter N (for Value, LogValue, and Exp) by optimizing for validation NLL loss over hundreds of runs, N ∈ {25, 50, 75, 100, 200, 400}. Besides our own dataset Wiki-Convert (see Chapter 3), we also train and test our methods on Numeracy600K (Chen et al., 2019), a corpus of financial market comments. For both datasets, we train on 100k samples, test on 10k, and use another 10k held-out dev set for configuring hyperparameters. For every input sentence, we randomly mask 15% of its non-number tokens and use a negative log likelihood loss to optimize the classifier. We measure perplexity and hit@k, masking one (non-number) word at a time. Implementation Details We use HuggingFace Transformers (Wolf et al., 2020) for pretrained models and PyTorch Lightning (Falcon et al., 2019) for finetuning. We only train the masked language modeling (MLM) classifier (initialized from scratch) and the number encoder’s parameters, if any, while keeping the base transformer weights frozen. The MLM classifier has a dense layer (768 × 768 weights) and a decoder (768 × 30522 weights) to learn output embeddings for each vocabulary item, where 768 is the embedding size for BERT-base-uncased. Value and Log Value methods with 200 hidden neurons thus consist of 768 × 200 (weight) + 200 (bias) + 200 × 1 (weight) extra parameters. Exponent embeddings with 200 bins consist of 768 × 200 extra parameters for the lookup embeddings. The Num model has a single lookup embedding, i.e., 768 extra parameters. None and Default methods do not contain extra parameters. 48 Wiki-Convert Numeracy600K PPL↓ Hit@1 Hit@5 Hit@20 Hit@100 PPL↓ Hit@1 Hit@5 Hit@20 Hit@100 Default 3.11 65.57 83.86 90.97 95.75 5.07 57.94 73.52 82.84 90.53 Num 3.32 64.04 82.83 90.37 95.28 5.29 57.36 73.24 82.18 90.24 None 3.36 63.73 82.58 90.20 95.28 5.18 57.75 73.78 82.76 90.41 Value 3.28 64.54 83.01 90.51 95.37 4.90 58.90 74.37 83.62 90.66 LValue 3.26 64.67 83.07 90.48 95.43 4.90 58.66 74.68 83.54 90.73 Exp 3.05 66.15 84.07 91.16 95.86 4.63 60.03 75.61 84.38 91.06 Table 5.2: Results on masked word prediction over two datasets and six methods, averaged over two runs with different random seeds. PPL = Perplexity. LValue = Log Value. Exp = Exponent embeddings. A dropout of 0.2 was applied to the hidden layer in Value and LogValue. We do not incorporate the next sentence prediction loss. While evaluating models that perplexity scores for masked LMs are lower than those of causal LMs since the former use bidirectional context while the latter only see preceding words. Our compute resources include sets of four GeForce RTX 2080 Ti GPUs, which take less than twenty minutes to train a model for 10 epochs of 10k training samples (batch size 256 with accumulated gradients over 4 batches). We set batch size as 1024, the largest that we could fit onto a single GPU, since we find that large batch sizes consistently help all methods and baselines. We train all models for 10 epochs over a training set of 100𝑘 sentences, i.e., ∼ 1000 updates, since we find this regime to allow nearly all runs to converge. 5.3 Results and Discussion Our experiments help us answer three key questions about the effect of numeracy on literacy for language models: 49 Does numeracy help improve word prediction when numbers are present? Table 5.2 shows the perplexities and prediction accuracies as hit@{1, 5, 20, 100} scores over the test splits of Wiki-Convert and Numeracy600K. We find that exponent embeddings are the top scorers on all dataset-metric combinations, achieving statistically significant improvements (at 99% confidence) against the default baseline. Numeracy600K is sourced from financial domain-specific articles and market comments, hence is the more challenging dataset. This is evident by the consistently higher perplexities and lower prediction scores. The Value and LogValue methods also manage to outperform the default baseline for Numeracy600K but they score below this baseline for Wiki-Convert. However, the latter dataset was sourced from Wikipedia, over which BERT was pretrained using the default scheme, hence this makes for an unfair comparison. Does numeracy lead to better literacy, even in contexts without numbers? We compare exponent embeddings (the best performer) against the default baseline on 1000 sampled sentences from the 2006 English dump of Wikicorpus (Reese et al., 2010) which do not have any annotated numbers. Table 5.3 shows that exponent embeddings continue to show much better results over the baseline. Default Exp PPL 7.30 ± 0.14 7.05 ± 0.06 H@1 51.12 ± 0.35 51.66 ± 0.26 H@5 67.89 ± 0.33 68.36 ± 0.11 H@20 78.09 ± 0.17 78.59 ± 0.15 H@100 86.98 ± 0.17 87.27 ± 0.11 Table 5.3: Results on masked word prediction in non-numeric contexts from Wikicorpus, averaged over three runs. PPL = Perplexity. Exp = Exponent embeddings. 50 Where exactly does numeracy help in improving literacy? We analyse examples where predictions from the default baseline erred while those from exponent embeddings were correct. Table 5.4 shows two representative kinds of such cases. The first three rows are examples of where we expect number encoders to help. The last row highlights a much more subtle semantic distinction (elevation vs altitude) between the two predictions. Our qualitative analysis suggests that most errors made by the default LMs are due to semantic subtleties. Case Sentence Exp* Default Intuitive The four petals are about 2 [mask] long and slightly hairy. mm meters Intuitive ... once [mask] along the hard shoulder of the M11 at 140 mph to avoid traffic? drove walked Intuitive With its solar panels fully extended it spanned 20 [mask] . meters kilometers Subtle The Grimsel Pass is a mountain pass in Switzerland ... at an [mask] of 2164 meters. elevation altitude Table 5.4: Qualitative error analysis over Wiki-Convert, showing examples where the Default baseline fails and the Exponent embeddings correctly predict the masked word. Asterisk* indicates: same as ground truth. Quantitatively, we further stratify our results by the kind of masked token: is it a unit (e.g., third row in Table 5.4) or not? Table 5.5 compares exponent embeddings against the default baseline, stratified over two categories of masked tokens: units and others. We find exponent embeddings to consistently outperform the default baseline over both categories. The majority of gains stem from non-unit tokens since they are more abundant than units. 5.4 Discussion The consistency of results over different corpora, configurations, and random seeds, suggest that specialized encoders do improve literacy. Such results warrant experiments on a larger scale, such as pretraining numerate language models from scratch. 51 Exp Default Others Units Others Units N 7187 717 7167 716 PPL 5.22 1.76 5.38 1.95 H@1 52.80 71.06 51.60 72.28 H@5 74.55 98.44 73.85 96.88 H@20 84.98 99.87 84.74 99.09 H@100 92.78 100.00 92.45 99.87 Table 5.5: Results over a sample of Wiki-Convert test set, stratified by the kind of token masked. This chapter studies the effect of number encoders on the task of masked word prediction, as a proxy for the ability of understanding text. We show that specialized number encoders are helpful in improving the word prediction ability of a language model, evaluated by perplexity and hit@k scores. We demonstrate these gains not only over sentences with annotated numbers but also more generally on text without numbers. We find exponent embeddings to be the best number encoders for masked word prediction. We see our work as preliminary evidence that numeracy enhances the literacy of language models. 52 Chapter 6 Beyond Numeracy: Tokenization of Multi Word Expressions Subword tokenization algorithms like Byte Pair Encoding (BPE) (Sennrich et al., 2016b) group together frequently occurring patterns, such as -ing or -ly, into individual tokens. The success of subword tokenization points to the benefit in modeling longer patterns, even though any given text can be represented simply as a sequence of characters. This chapter stretches the motivation further by allowing BPE to cross word boundaries. In the context of NMT, we find that the straightforward way to find MWEs by BPE (sorted by frequency) hurts performance whereas sorting by PMI scores improves scores. We hypothesize and discuss a reason for these observations and provide further recommendations on using MWEs with BPE. N-gram tokens have been used in traditional NLP for a long time and with much success. For example (Table 6.1), the bigram New York can be a concise yet useful feature in a Named Entity Recognition task. Similarly, a Spanish-English Machine Translation (MT) model might benefit from having the bigram te amo or its trigram translation I love you in its vocabulary. Finally, a model’s vocabulary could even extend to non-contiguous tokens or k-skip-n-grams such as neither · nor. This token reappears in several contexts e.g. neither tea nor coffee and neither here nor there (underlined words replace the · skip). 53 Raw He lives in New York . Tok He_ lives_ in_ New_York_ ._ Raw I love the Statue of Liberty! Tok I_ love_ the_ Statue_of_Liberty_ !_ Raw She lost her bag . Tok She_ · her_ lost_ <SKIP> bag_ ._ Table 6.1: Example tokenizations of MWEs (bigrams, trigrams, skip-grams) in our implementation. Raw = original sentence, Tok = tokenized form. Typical BPE tokens are colored yellow and MWEs are colored green. This chapter experiments with two ways to expand BPE with MWEs for the task of NMT. Concretely, we promise the following contributions: Lang. Pair Hi → En De → En Split Dev Test Dev Test sacre chrF sacre chrF sacre chrF sacre chrF Metric BLEU 𝛽 = 2 BLEU 𝛽 = 2 BLEU 𝛽 = 2 BLEU 𝛽 = 2 Baseline 20.8 49.5 22.0 52.3 39.1 62.4 35.6 59.1 Unigram 19.5 49.0 21.2 51.5 36.5 60.3 32.4 56.8 BPE+ngms 19.5 49.0 21.2 51.6 38.7 62.2 35.3 58.9 BPE+n/sgms 18.4 48.1 20.7 51.3 38.4 62.1 35.2 58.9 PMI methods Bigrams 20.6 49.2 22.2 52.6 39.1 62.4 35.8 59.3 Trigrams 20.7 49.5 22.0 52.3 39.0 62.2 35.7 59.0 N-grams 21.2 50.0 22.1 52.6 38.9 62.3 35.8 59.1 Skip-grams 20.6 49.9 22.1 52.4 38.7 62.1 35.9 59.2 Table 6.2: Different methods of adding MWEs to a BPE vocabulary on NMT across two language pairs. 54 1. We find, counter-intuitively, that the straightforward frequency-based BPE, when applied beyond words, performs worse than baseline on NMT across two language pairs. 2. We hypothesize that this negative result is caused by the constituents of such high frequency MWEs (e.g. 𝑖𝑛_𝑡ℎ𝑒) combining in many diverse ways, rendering such tokens incoherent. 3. We show that PMI-based BPE for MWEs reverses the drop and improves BLEU scores. We offer more recommendations on where and how to use MWEs with BPE. 6.1 Methods MWEs have been commonly used in traditional NLP but rarely in the age of transformers and subword vocabularies. Here we describe two kinds of ways to add MWEs to a BPE vocabulary. 6.1.1 BPE beyond words Our baseline is the vanilla BPE tokenization scheme which starts from characters and iteratively adds the most frequent subwords to vocabulary. An intuitive extension to BPE is BPE+ngms, i.e., allowing BPE to choose between not just adding subwords but also frequently occurring n-grams (e.g., of_the appears at 163𝑟𝑑 position in vocabulary). This chapter limits n-grams to bigrams and trigrams. Besides continuous multi-word expressions, we also experiment with discontinuous MWEs, i.e., k-skip-n-grams, which we refer to concisely as skip-grams. In particular, we focus on 1-skip-3- grams, e.g., neither· nor, I · you. We replace a 1-skip-3-gram (𝑤1 ·𝑤2) occurrence with (𝑤12 · <SKIP>) where 𝑤12 is a new token representing the occurrence of this specific 1-skip-3-gram, and <SKIP> 55 is another new token but shared by all skip-grams to indicate that the skip-gram ends here. The last row of Table 6.1 shows an example tokenization with skip-grams. In BPE+n/sgms, we allow frequent skip-grams (e.g., ( · ); neither · nor ) to also be part of the vocabulary. 6.1.2 Adding MWEs with PMI As hinted previously, the intuitive extension to BPE does not work well in practice. Instead of raw frequency, here we find MWEs using a common technique of finding word collocations: Pointwise Mutual Information (PMI), which is a measure of the association between two word types in text. We calculate PMI of n-grams as: 𝑃𝑀𝐼(𝑎1, ..., 𝑎𝑛) = log( 𝑃 (𝑎1, ..., 𝑎𝑛) Î𝑛 𝑖=1 𝑃 (𝑎𝑖) ) where 𝑎𝑖 are unigrams (words) from the corpus; 𝑃 (𝑎𝑖) denote their independent probabilities; and 𝑃 (𝑎𝑖 , ...𝑎𝑛) denotes joint probability of n-grams. In this chapter, we report experiments with only Bigrams (𝑛 = 2), Trigrams (𝑛 = 3), and their combination N-grams. We also experiment with Skip-grams or 1-skip-3-grams (𝑤1 ·𝑤2) from our corpus in the same way as bigrams (𝑤1𝑤2), ordered by PMI. We identify candidate word pairs separated by one word (which we depict by ·) and sort them based on PMI scores, some of which are deemed good enough to replace the least frequent subwords in the BPE vocabulary. We find that the skip-grams obtained by simply ordering by PMI are often better suited to be trigrams, e.g., the · in Statue · Liberty, a high-ranked candidate skip-gram, is almost always of. To disentangle such skip-grams, we filter out candidates where the middle (skipped) word has a spread-out distribution: the skipped word in I · you could be replaced with several words like 56 love, hate, or miss. In practice, we filter these by enforcing (1) a lower limit (15) on the number of unique words which replace the · token, and (2) an upper limit on the probability (10%) of the most frequently occurring skipped token for the particular skip-gram. 6.2 Datasets We use the IIT Bombay Hindi-English parallel corpus v3.0 (Kunchukuttan et al., 2018), tokenized using IndicNLPLibrary (Kunchukuttan, 2020) and Moses Tokenizer (Koehn et al., 2007a) respectively. The Train : Dev : Test splits have 1.6𝑀 : 0.5𝐾 : 2.3𝐾 sentences respectively. For German-English, the datasets are retreived from the News Translation task of WMT2019 (Barrault et al., 2019). The Train : Dev : Test splits have 4.5𝑀 : 3𝐾 : 2𝐾 sentences respectively. While we use the originally mentioned training set for our main results in Table 6.2, we found several noisy sentence pairs in the training dataset (the dev and test set were clean). Some such sentences had English characters (latin alphabet) in the source (Hindi) side and others had nonEnglish characters on the target (English) side. We filtered out 250K sentence pairs where either the source side had non-Hindi characters or the target side had non-English characters, wherein we count the following near-universal symbols as part of either language: ., () []! : −” ′ ; <>?&˘@ 6.3 Experiments While MWEs can augment the subword vocabulary of any NLP model, this chapter focuses on the task of NMT. Following Gowda and May (2020), we fix the transformer architecture (Vaswani et al., 2017) and train models with different vocabularies from scratch. 57 Train Dev Test Hi-En IITB-Training (1.3M) IITB-Dev (0.5K) IITB-Test (2.5K) Europarl v10 (1.8M) De-En WMT13CommonCrawl (2.4M) NewsTest18 (3K) NewsTest19 (2K) NewsCommentary v14 (0.3M) Table 6.3: Training, validation and testing datasets along with sentence count in each set. from Hi-En from De-En to De-En to Hi-En Bi 1.55% 1.30% Tri 0.30% 0.40% Skip 13.34% 13.45% Bigrams Trigrams Skip-Grams Freq per cent New York City the · of of the New York European Central Bank a · of do not Prime Minister Italian Prime Minister ( · ) they are Middle East behind closed doors was · to as well as United Nations former Prime Minister not · to one of the Table 6.4: Left: Coverage of the top 5 most frequent English MWEs (PMI-based), extracted from the first language pair and (coverage) evaluated over the second. Coverage of a token is defined as the fraction of target (English) sentences containing the token. Right: The top five MWEs of each type (PMI except when labelled Freq). Our baseline vocabulary is BPE with 8K subword tokens for Hi-En and 16K for De-En. Each of our methods maintains the same vocabulary size, replacing the least frequently occurring subwords with corresponding n-grams or skip-grams. We show representative MWEs learned from corpora in Table 6.4 alongside the coverage of (PMI) MWEs across language pairs. We also compare with a Unigram (Kudo, 2018b) SentencePiece vocabulary of 8K tokens each on source and target sides, with 𝑠𝑝𝑙𝑖𝑡_𝑏𝑦_𝑤ℎ𝑖𝑡𝑒𝑠𝑝𝑎𝑐𝑒 flag set to false (Kudo and Richardson, 2018). This allows the Unigram method to go beyond the word boundary and add n-grams to its vocabulary. Our NMT model is a 6 layer transformer encoder-decoder (Vaswani et al., 2017) that has 8 attention heads, 512 hidden vector units, and a feed forward intermediate size of 2048, with GELU 58 activation. We use label smoothing at 0.1, and a dropout rate of 0.1. We use the Adam optimizer with a controlled learning rate that warms up for 16K steps followed by a decay rate recommended for training transformer models. We trim longer sequences to a maximum of 512 tokens after BPE. Each model is trained from scratch, and the hyperparameters (per language pair) are chosen by grid search to optimize the baseline validation BLEU. We train all models for up to 100𝐾 steps (batch size = 24𝐾 tokens) and report sacreBLEU (Post, 2018a) and chrF (𝛽 = 2) scores (Popović, 2015). The number of tokens replaced in the original BPE vocabulary with a corresponding MWE ordered by PMI, is also a hyperparameter optimized by grid search between 1.25% to 10% of the vocabulary size (Hi-En models performing best when 1.25% tokens were replaced and De-En models performing best at 2.5% for Bigrams/Trigrams and 5% for Skipgrams). We make sure to not replace any rare base characters like 𝑄 or @. For ablations (Section 6.4.2) with limited compute budget, we train Hi-En models for up to 200K steps.We apply a patience of 10 validations, each 1000 update steps apart. To decode, we average the best 3 checkpoints, and use a beam size of 4 with length penalty of 0.6. We use NLCodec and RTG libraries (Gowda et al., 2021a) and contribute our extensions to them as well. 6.4 Results Table 6.2 shows our main results. We find that naively extending BPE beyond words harms the model, and Unigram likewise fails to consistently outperform the baseline. On the other hand, adding MWEs using PMI gives the best performance across language pairs and metrics. 59 Figure 6.1: Qualitative error analysis over Hi-En test set, showing examples comparing the Baseline and the Skip-Gram augmented model, where the skip-gram (This · is) occurs in the latter’s predictions. Moreover, since the methods of extracting MWEs is purely emprirical and is language agnostic, the results and observations can be extended for different language pairs. We now attempt to reason why BPE fails beyond word boundaries in its vanilla form, and why switching to PMI solves the problem. We also study where does it help the most to add MWEs. Unless noted otherwise, the analysis is reported on the Hi-En dataset. 6.4.1 Words combine in Diverse ways Empirically, we observe (Table 6.2) that BPE with high frequency MWE tokens sees a drop in performance whereas the PMI counterpart as well as the original baseline (within word boundary) performs well. What then happens at the word boundary that the BPE algorithm stops working? We hypothesize that this is the result of words combining in more diverse ways than subwords. BPE beyond word boundary adds frequently occurring n-grams to its vocabulary such as 𝑖𝑛_𝑡ℎ𝑒 which occurs in over a tenth of all test sentences. Despite adding it as a separate token to the vocabulary, the average BLEU on this subset of test sentences drops compared to the baseline (20.0 vs 21.8)! One factor for this result could be that the constituents of 𝑖𝑛_𝑡ℎ𝑒 combine in more 60 ways than one. The word 𝑖𝑛 appears as the ending of over 30 n-grams (𝑡ℎ𝑎𝑡_𝑖𝑛, 𝑤𝑎𝑠_𝑖𝑛, . . .) and the word 𝑡ℎ𝑒 appears as the beginning of 200 other n-grams (𝑡ℎ𝑒_𝑝𝑒𝑜𝑝𝑙𝑒, 𝑡ℎ𝑒_𝑓 𝑖𝑟𝑠𝑡, . . .) - all of which combine to a total of over another tenth of the test set, more than the frequency of 𝑖𝑛_𝑡ℎ𝑒 itself. Such versatile combinatorics is rarely observed at the subword level. Suffixes like 𝑖𝑛𝑔 almost never appear as prefixes whereas prefixes like 𝑑𝑒 almost never appear as suffixes. When such subwords combine to form longer tokens, they generally retain a coherent meaning, unlike ngrams like 𝑖𝑛_𝑡ℎ𝑒. Finally, this hypothesis may explain why MWEs ordered by PMI help improve MT scores – they are by definition units that co-occur as a coherent unit. Indeed, the MWEs thus found (e.g. 𝑁 𝑒𝑤_𝑌𝑜𝑟𝑘, 𝑝𝑒𝑟_𝑐𝑒𝑛𝑡) include constituents which exclusively form only these tokens. To summarize, we argue that BPE stops working at word boundaries because word pairs rarely, unlike subwords, combine into meaningful units that deserve a unique representation. We find convincing arguments from sentence-level BLEU scores and the number of different ways the constituents of different tokens occur, more of which are reported in supplementary materials. 6.4.2 Where do MWEs help NMT? Here, we conduct ablations for the PMI method (on a smaller batch size of 1K tokens, on the Hi-En dataset) to determine whether MWEs help more for machine translation on the source side (Hi), on the target side (En), or both? Table 6.2 reports on the ‘both’ setting but here we revisit this design choice. Table 6.5 reports BLEU scores with each such variant. Bold-faced cells indicate the best performing (on dev set) variant for every row. We observe that continuous MWEs (bigrams and trigrams) benefit more on the source-side whereas discontinuous MWEs (skip-grams) help the 61 Target (En) Source (Hi) Both Bi 14.4 / 14.8 15.9 / 16.0 15.8 / 15.3 Tri 14.7 / 15.4 15.5 / 15.5 15.4 / 15.2 Skip 15.3 / 15.2 15.1 / 15.1 15.5 / 15.0 Table 6.5: Do MWEs help more when added to the source-side, the target-side or both? Each cell reports Dev/Test BLEU scores over Hi-En dataset only. Baseline scores without MWEs are 15.6 / 14.4 respectively. most when applied to both source and target side. Note that, since De-En has been usually used in a triple shared vocabulary setting, we followed the same and thereby it must always follow the ‘both’ model. Finally, we show in Figure 6.1 some representative examples of sentences with MWEs (particularly, the skip-grams) from the PMI-BPE Hi-En model’s vocabulary. The first two rows show examples where the skip-gram indeed occurred in the reference, hence it helped the model. The last row shows how the model overuses the skip-gram, i.e. using skip-gram instead of separate tokens, and gets a translation wrong thus hurting the score as the reference sentence does not use the skip-gram. We note that BLEU itself relies only on the presence or absence of contiguous n-grams, and may unfairly penalize paraphrased outputs such as these. 6.5 Related Work Attempts at merging NMT with MWEs typically include pairing up the network with a phrase based SMT system (Wang et al., 2017; Park and Tsvetkov, 2019; Lample et al., 2018) and hierarchical phrases are expressive enough to cover discontinuous MWEs (Chiang, 2007). Zaninello and Birch (2020) add manually annotated MWEs aligned across the source and target language (En-It). 62 Figure 6.2: Top scoring multi-word expressions extracted from the training corpora. However, this might not work for low resource languages, hence we extract MWEs automatically with PMI. They count discontinuous MWEs, one of our main contributions, among future work. Multi-word tokens have a proven track record in NLP. Skip-gram tokens, for instance, have already been used in phrase-based machine translation (Lample et al., 2018; Park and Tsvetkov, 2019; Wang et al., 2017) to tackle cases where certain phrases in a source language (duonianlai in Chinese) are better represented as skip-grams in a target language (over the last · years in English) (Chiang, 2007). Our work revisits these ideas and adapts them to a transformer-based NLP model relying on subword segmentation. There also exists prior work on defining, counting, and evaluating k-skip-n-grams (Guthrie et al., 2006; Pickhardt et al., 2014; Ptaszynski et al., 2014), although unrelated to the task of NMT. Finally, readers interested in other applications of extracting MWEs via PMI scores may refer to Levine et al. (2021) where similar techniques are used to efficiently mask tokens while pretraining BERT (Devlin et al., 2019). 6.6 Discussion This chapter systematically studies the impact of extending a BPE vocabulary with multi-word expressions for neural machine translation. BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An 63 intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (𝑖𝑛_𝑎), trigrams (𝑜𝑢𝑡_𝑜 𝑓 _𝑡ℎ𝑒), and skip-grams (ℎ𝑒·ℎ𝑖𝑠). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., 𝑁 𝑒𝑤_𝑌𝑜𝑟𝑘, 𝑆𝑡𝑎𝑡𝑢𝑒_𝑜 𝑓 _𝐿𝑖𝑏𝑒𝑟𝑡𝑦, 𝑛𝑒𝑖𝑡ℎ𝑒𝑟·𝑛𝑜𝑟) which consistently improves translation performance. Our results point to the vast unexplored scope of different granularities of tokenization that can be exploited by NLP systems. Notably, our methods extend to not only longer contiguous tokens like n-grams but also skip-grams, which have been relatively unexplored with transformer-based NLP. We release all code at https://github.com/pegasus-lynx/mwe-bpe. 64 Chapter 7 Towards End-to-End Learnt Tokenization Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as ‘ing’ or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level language models are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This chapter considers an alternative ‘learn your tokens’ scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary language model, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizers outperform by over 300% both subwords and byte/character models over the intrinsic language modeling metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the language modeling setup for all three categories of 65 Efficiency Expressivity Accuracy Subword High Low Mid Byte/Char Low High Low eByte/eChar Mid Mid High Table 7.1: Trade-offs involved when choosing tokenizers: Subword vs Bytes/Characters vs eByte/eChar (ours). tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness. Recent work has shown countless limitations with subword embeddings. Several languages contain diverse morphological features whereas subword segmentation is mostly apt at only identifying suffixes and prefixes (Clark et al., 2022). Technical domains such as biomedical documents often need to pre-train their own tokenizer for improved vocabulary (Boecking et al., 2022a). Finally, numbers are often inconsistently segmented into subwords, leading to decreased arithmetic (Wallace et al., 2019) and estimation (Thawani et al., 2021c) skills. The extent of these numeric limitations is so dire that GPT-4 (OpenAI, 2023) has an explicit workaround of adding all numbers from 0 to 999 as individual tokens to the model’s vocabulary. Recently, several language models have been proposed which remove the tokenizer vocabulary entirely, beginning with a character (El Boukkouri et al., 2020) or byte-level (Xue et al., 2022) vocabulary and often compressing them into fixed units of around four tokens each (Tay et al., 2021; Yu et al., 2023; Clark et al., 2022). While these zero-assumption methods are useful in compressing text and consequently expand context windows, they completely ignore the word boundary. Besides, the so-called ‘tokenizer-free’ byte-based models are not entirely bias-free since the Unicode-8 encoding they use is itself biased towards representing Latin scripts with a single 66 byte each, whereas some scripts1 like Bammum (Africa), Meetei (India), and Cherokee (North America) may require four bytes to represent a single character. The concept of words is a fundamental feature of nearly all human languages, including those written in Chinese or Japanese scripts that do not explicitly delineate words by whitespaces. This chapter empirically studies the case where tokenizers lose their subword segmentation algorithms but utilize the word boundary for a multi-level model with added efficiency. More concretely, we use the word boundary to compress the base tokens of bytes or characters into word representations, which are then fed into the underlying language model (here, a small version of GPT (Radford et al., 2018)). Our end-to-end learned tokenization undoubtedly has several limitations. It is not faster than subwords. It does not allow characters/bytes within one word to directly attend to those in another word. It relies on the word boundary, which is not straightforward to find for most internet-scale datasets. Nevertheless, we believe this empirical deep-dive into tokenizers for language modeling offers the following contributions: 1. We compare different tokenizer strategies for language modeling on multiple facets and on a fair footing across languages. 2. We are the first to explicitly use word boundary to compress an autoregressive language model’s base tokens. 3. We report over 300% gains in language modeling capabilities over multiple languages and datasets, against both subwords and character/byte models, and by a factor of 30 on rare words. 1https://unicode.org/roadmaps/bmp/ 67 4. We theoretically analyze strengths and weaknesses of our word-compressed tokenization scheme, which carries insights for the language modeling community. 7.1 Method Figure 7.1 pictorially depicts our proposed language model architecture. Our end-to-end tokenization strategy is a straightforward word pooling method which uses a transformer encoder (Step 1) to pool in the base tokens (characters or bytes) into a fixed number of embeddings per word. This is analogous to how CLS embeddings are often used to pool in the embeddings of an entire sentence or any text sequence in BERT-like transformer encoders. In our case, we have the equivalent of a fixed number of CLS tokens2 prepended to each word that store the meaning of the entire word. Next, (Step 2) the pooled per-word embeddings are passed onto the main language model, in our case, a vanilla transformer decoder like GPT (Radford et al., 2018). Finally, the contextualized word embeddings are fed through another transformer decoder to autoregressively decode the next word, one base unit (character or byte) at a time. Note that we call this method an endto-end ‘tokenizer’ since it compresses the many units into a few embeddings per word, just like subwords, except the compression is learned from scratch. Finally, at decoding stage (Step 3), the contextualized word representations are unrolled with another transformer decoder to autoregressively predict one base token (character/byte) at a time. 3 2This chapter uses 4 CLS tokens per word, except in Section 7.3.2 where we ablate with 1 CLS per word. 3This autoregressive decoder can also be replaced by a non-autoregressive transformer which emits the entire word in O (1) time. Our initial experiments with such a vanilla setup performed much worse than autoregressive models (in line with prior work), therefore we leave this to future work. 68 Figure 7.1: Overview of our proposed simple end-to-end tokenized autoregressive language model. A transformer encoder compresses the variable number of base units (here, characters) into n=1 CLS tokens per word. Dotted characters are the previously predicted tokens at inference, and when training they are the ground truth. 69 Figure 7.2: Self-attention visualized across (1) Byte/Char-level models, (2) Subword/Word-level models, and (3) Our proposed end-to-end tokenization modules (word encoder; base LM decoder; word decoder) with character base. Blue blocks indicate self attention mask. @ symbol indicates a prepended CLS token per word. 70 Note how we achieve our purported trade-off between subwords and byte/character models. The CLS representations learnt are unconstrained by a deterministic mapping as in subwords. They are also efficient to compute and decode from, since the first and last steps only allow intra-word attention. For a tokenizer-free model, roughly 80% of the memory bottleneck4 is spent on tokens from one word attending to tokens on another word, which we contest is of questionable importance relative to the overhead incurred. Formally, we begin with a sequence of words 𝑤0,𝑤1, . . . ,𝑤𝑛 each of which is comprised of an ordered set of base units (character/bytes) 𝑤𝑖 = 𝑐 0 𝑖 , 𝑐1 𝑖 , . . . , 𝑐 𝑚𝑖 𝑖 where 𝑚𝑖 + 1 is the length of the 𝑖th word. The task is autoregressive language modeling, i.e. given the previously seen words 𝑤0,𝑤1, . . . ,𝑤𝑖−1 as well as the previously seen units in𝑤𝑖 (the current word): 𝑐 0 𝑖 , 𝑐1 𝑖 , . . . , 𝑐 𝑗−1 𝑖 predict the next unit 𝑐 𝑗 𝑖 . Character/byte level models ignore the word-boundary and directly model the task as: 𝑐 𝑗 𝑖 = Decoder(𝑐 0 0 , . . . , 𝑐 𝑚0 0 , 𝑐0 1 , . . . , 𝑐0 𝑖 , . . . 𝑐 𝑗−1 𝑖 ) Subword segmentation maps the base units deterministically into fewer subwords per word, i.e., 𝑤𝑖 = 𝑐 0 𝑖 . . . 𝑐 𝑚𝑖 𝑖 → 𝑠 0 𝑖 . . . 𝑠 𝑚′ 𝑖 𝑖 where 𝑚′ 𝑖 ≤ 𝑚𝑖 , the number of subwords that the 𝑖th word is decomposed into. Following this determinsitc process, a subword model predicts the next subword as: 𝑠 𝑗 𝑖 = Decoder(𝑠 0 0 , . . . , 𝑠 𝑚′ 0 0 , 𝑠0 1 , . . . , 𝑠0 𝑖 , . . . 𝑠 𝑗−1 𝑖 ) 4 In Figure 7.2, this is the difference in blue attention blocks depicted in Figure between Byte/Char-level models and our intra-word attention. 71 Our end-to-end models instead follow a three-step process to (1) pool base units into a fixed set of embeddings per word, (2) autoregressively predicting the next word embedding, and (3) autoregressively predicting individual unit embeddings per word: CLS𝑖 = Encoder(𝑐 0 𝑖 , 𝑐1 𝑖 , . . . , 𝑐 𝑚𝑖 𝑖 ) (7.1) CLS′ 𝑖 = Decoder(CLS0, CLS1, . . . , CLS𝑖−1) (7.2) 𝑐 𝑗 𝑖 = Decoder(CLS′ 𝑖 Ê𝑐 0 𝑖 , . . . , 𝑐 𝑗−1 𝑖 ) (7.3) Here, Encoder refers to a transformer BERT-like encoder and Decoder refers to a transformer GPT-like decoder. From an implementation standpoint, we prefix a fixed number (𝑛 = 1 or 4 in this chapter) of CLS tokens to every word before passing it through a transformer encoder. The word-level contextualized representations obtained on the other end are collectively depicted here as 𝑤𝑖 . Figure 7.2 is a visualization of how our end-to-end model saves on self-attention computation bottleneck by only allowing intra-word attention at the first step, before allowing contextualization of information across the word boundary in step 2 using the base decoder model. Finally step 3 again restricts the individual characters/bytes to be predicted using only the single word-level predicted embeddings.5 5Note that our current implementation has minor deviations from the shown simplistic figure. Refer to Section 7.2.4 for details. 72 7.2 Experiments There are numerous NLP tasks that can benefit from improved tokenization, such as Machine Translation, Question Answering, and Text Classification. However, the scope of our preliminary analysis is not to cast a wide net over every downstream application. Instead, we choose to analyze in depth the most commonly used pre-training task in NLP i.e. language modeling. We pretrain autoregressive language models from scratch using different tokenizers described in the previous section, on different datasets described in Section 7.2.2. 7.2.1 Models We report results over the following tokenizers: 1. Subword: a pretrained BPE vocabulary used by GPT-2 and GPT-3. 2. Byte: a pretrained byte-level vocabulary as implemented in ByT5 Xue et al. (2022). 3. Character: a corpus-specific vocabulary learnt from each dataset, with a fallback to UNK for characters unseen in training. 4. eByte/eChar: Our end-to-end tokenized models which begin with the above Byte/Character vocabularies, but are compressed into CLS representations as described in Section 7.1. There can be countless ways to make for a ‘fair’ comparison across tokenizers. We train all models on all datasets for the same number of total epochs. We also focus on letting the models access the same context window size, i.e. amount of information available to predict the next set of tokens. Different tokenizers can use vastly different memory sizes to fit the same amount of information. This is analogous to how the same book can be published in different font sizes to 73 Dataset Size Words Chars/ (MBs) (Mil.) Word English 4.7 1.34 5.46 French 5.1 1.55 5.18 Russian 7.5 1.18 6.39 Numeracy 6.6 1.35 5.09 Table 7.2: Statistics for our language modeling datasets. See Section 7.2.2 for more details. Tokenizer Acc Mem Params Acc Mem Params Acc Mem Params (%) (GBs) (Mil.) (%) (GBs) (Mil.) (%) (GBs) (Mil.) Language English French Russian Subword 14.37 0.55 76.8 41.20 1.50 76.8 8.31 1.49 76.8 Byte 13.69 0.53 25.7 17.39 0.54 25.7 12.76 0.53 25.7 Char 13.68 0.54 26.3 16.95 0.53 25.7 10.01 0.54 26.1 eByte 44.17 3.84 38.7 46.44 6.01 38.7 35.00 4.92 38.7 eChar 42.94 2.94 39.2 47.06 3.59 38.7 37.15 3.95 39.0 Table 7.3: Word Prediction Accuracies (Acc %) for different languages and tokenizers. See Section 7.3.1 for details. choose between light and bulky books. We control for this information parity by fixing the number of characters in the available context to 192 for each tokenizer and each dataset. Subword models will then be allowed to access 192//N subwords where N is the average number of characters per subword. 7.2.2 Datasets Our proposed method requires access to a word boundary signal, which can either be obtained from a clean natural language corpus, or by running a preprocessing pipeline on an unclean 74 corpus to filter out nonlinguistic tokens such as URLs or metadata. We chose the former to avoid confounding our results with a layer of preprocessing decisions. Therefore, our datasets are smaller but cleaner than the large-scale mC4 and OSCAR datasets typically used for training large language models. Our choice of languages depended on the availability of a large enough corpus of clean data. We also deliberately avoid Chinese and Japanese corpora since segmenting them into words would require an additional, possibly confounding step of segmentation through an off-the-shelf model. Concretely, here are the four datasets we pre-train and evaluate our language models on: 1. English: We randomly sample 10,000 paragraphs from the comprehensions of SQuAD2.0 (Rajpurkar et al., 2016) dataset. 2. French: We randomly sample 10,000 paragraphs from the comprehensions of SQuAD_FR (Cattan et al., 2021) dataset. 3. Russian: We randomly sample 10,000 paragraphs from the reading passages of the SberQuAD (Efimov et al., 2020) dataset. 4. Numeracy: We sample 60,000 rows of number-annotated sentences from Wiki-Convert (Thawani et al., 2021a), itself derived from the English Wikipedia. The task is to estimate these numbers approximately using the preceding words as context. Table 8.2 presents statistics for the datasets that we use. The average dataset consists of 7.4M characters (676 unique) and 1.4M words (102k unique). 75 7.2.3 Metrics Since the models have different vocabularies, we can not compare their perplexity scores. Instead, we fix the number of context to be exactly 192 characters and report the accuracy of predicting the next word (over held-out validation data from the same corpus as the training set). When estimating numbers, we report magnitude-based metrics that are typically reported in the literature (Thawani et al., 2021a; Berg-Kirkpatrick and Spokoyny, 2020): the order-of-magnitude Exponent Accuracy (EAcc: whether the number of digits are the same in the ground truth and predicted number) and Median Absolute Percentage Error (MdAPE: median of 100|𝑥 − 𝑦|/𝑦 where x is the prediction and y is the ground truth number). 7.2.4 Implementation Every model (using a different tokenizer) is pre-trained from scratch on every dataset described above. We report the aforementioned metrics on the individual test set from each corpus. Our base language model is a decoder-only transformer called minGPT6 with 8 layers. For our end-to-end models, the main language model (Step 2) remains the same - with 8 layers like the others, whereas the word-encoder (Step 1) and word-decoder (Step 3) are both shallow transformers (encoder and decoder respectively) with 2 layers each. They use padding tokens to make each word of equal length for ease of training. We use trained absolute positional embeddings for all models, and the end-to-end models use it thrice - one for each step. We pretrain all models on all datasets from scratch for 100 epochs. 6https://github.com/karpathy/minGPT 76 We set the learning rate to 0.0001, batch size to 2, and block size to 192. We used AdamW as our optimizer and trained our models on NVIDIA A100-PCIe-40GB GPUs. With this configuration, training each model variant for 100 epochs took an average of 52 hours. 7.3 Results 7.3.1 Main Results Our main results are summarized in Table 7.3. Next word prediction accuracies over different datasets show that given a fixed context window, our end-to-end tokenized language models perform much better (up to 300% from 14% to 44% on English) on all datasets than both the default BPE subwords as well as the tokenizer-free character and byte models. This does come at a doubling of GPU memory requirements, due to the additional word-level modules in our architecture. 7.3.2 Representation Power Here we ablate the representative power available for word-pooling of character- or byte-level embeddings. This hyperparameter is controlled simply by adding a different (yet fixed) number of prefix CLS tokens per word before encoding via a transformer. Table 7.4 shows the word prediction accuracies and relative jumps when the number of prefix CLS tokens per word is increased from 1 to 4. We notice a huge jump for every model, with the trade-off in sequence description length. Note, however, that the memory usage does not jump by more than 20 MBs. Similarly, the number 77 Lang Tok 1 CLS 4 CLS Δ% Δ Mem en eByte 31 44 42% 0.02 en eChar 29 43 48% 0.02 fr eByte 31 46 48% 0.02 fr eChar 34 47 38% 0.02 ru eByte 26 35 35% 0.02 ru eChar 29 37 28% 0.02 Table 7.4: Word Prediction Accuracies for different representative power (number of prefix CLS tokens) per word in our end-to-end byte/char-tokenized (Tok) models. Up to 45% higher prediction scores are available for a marginal increase in memory (Mem in GBs) of about 20 MBs. See Section 7.3.2 for details. of parameters also increases (not shown in table) by only 300K (0.7%) for both eByte and eChar models. 7.3.3 Predicting Rare Words One of the primary motivations for subword tokenization is their ability to compositionally create rarer words using other frequently occurring subwords. Wolleb et al. (2023) recently show that such compositionality is a significant contribution to the empirical performance gains achieved by subword models. Hence, we report in Table 7.5 the word prediction accuracies for rare words (those seen less than 10 times in the training dataset) as well as frequent ones (those seen more than 45 times). We find our end-to-end models outperform by a factor of 5-7 on frequent words and over 30 times on rare words! 78 Tokenizer Rare Frequent Subword 0.11 7.20 Byte 0.00 4.36 Char 0.28 9.84 eByte 5.90 42.90 eChar 6.78 44.17 Table 7.5: Case study: Word Prediction Accuracies for Russian across tokenizers, stratified by Rare and Frequent words. See Section 7.3.3 for details. Tokenizer % Num ↑ EAcc ↑ MdAPE ↓ Subword 20.0 44.8 95.72 Byte 39.9 40.5 99.00 Char 42.8 46.6 92.5 eByte 47.5 49.9 88.37 eChar 46.7 45.6 90.0 Table 7.6: Number Estimation results on Numeracy dataset across tokenizers. % Num = the percentage of times the model predicts a number, over which the next two metrics are calculated. EAcc = Exponent Accuracy. MdAPE = Median Absolute Percentage Error. See Section 7.3.4 for details. 7.3.4 Number Estimation We further evaluate a representative subset of tokenizers on WikiConvert number estimation task. Table 7.6 again reports that the ability of end-to-end-tokenized eByte/eChar is far better than both subword as well as Byte/Char models. 79 Dataset En Fr Ru Ru (rare) Ru (freq) Numeracy [3] Metric Next Word Prediction Accuracy EAcc↑ MdAPE↓ Subword 14.37 41.20 8.31 0.11 7.20 44.8 95.7 Byte 13.69 17.39 12.76 0.00 4.36 40.5 99.0 Char 13.68 16.95 10.01 0.28 9.84 46.6 92.5 eByte 44.17 46.44 35.00 5.90 42.90 49.9 88.4 eChar 42.94 47.06 37.15 6.78 44.17 45.6 90.0 7.4 Efficiency Analysis Here, we determine the theoretical training and inference/generation speed-up accessible by compressing words using our end-to-end tokenizer as opposed to tokenizer-free methods, while also comparing against the more efficient subword models. 7.4.1 Training Speed-up Assume M total memory budget available (say, in GBs) and a context window of 𝑇 characters per batch. Also assume the tokenizer-free model (henceforth referred to as base/baseline) is a decoder transformer with 𝐿 layers and 𝐷 dimensions. The memory footprint would most significantly depend on the number of activations stored 7 which can be estimated as 𝑀 = 𝐿𝐷𝐵𝑇 2 where 𝐵 is the batch size. Given a fixed memory budget of 𝑀 and required context size of 𝑇 characters, we can find our optimal batch size as: 𝐵 = 𝑀 𝐿𝐷𝑇 2 7The other components such as parameter variables and backpropagated gradients can be approximated away. 80 Assuming the training corpus comprises of 𝑁 characters, the number of training iterations required is: 𝑋 = 𝑁 𝐵𝑇 = 𝑁𝐷𝐿𝑇 𝑀 Next, for subwords, a similar batch size can be estimated as: 𝐵 ′ = 𝑀 𝐿𝐷𝑇 2/𝑠 2 where 𝑠 is the number of characters per subword (roughly 2.8 for our three languages). Substituting to find the number of training steps: 𝑋 ′ = 𝑁 𝐵 ′𝑇 = 𝑁𝐷𝐿𝑇 𝑀𝑠2 The training speed-up of a subword model is therefore estimated to be 𝑋/𝑋 ′ = 𝑠 2 = 7.8x. Finally, we calculate the analogous number of training steps required for one epoch of our end-to-end character-tokenized model. We assume 𝐿/4 word-encoder layers, 𝐿 primary LM (word level) layers, and 𝐿/4 word-decoder layers for simplicity (this is our default setup in this chapter). Let 𝐵 ′′ be the optimal batch size that we wish to calculate and 𝑐 be the average number of characters per word (roughly 5.5 for English). Note that we retain 𝑇 characters as our context window, therefore the average number of words per batch sequence will be 𝑇 /𝑐. The memory footprint of activations would then be (𝐿𝐷𝐵′′𝑇𝑐)/4 for the word encoder (and same for word decoder) and (𝐿𝐷𝐵′′𝑇 2 )/(𝑐 2 ) for the primary (word level) language model. 81 This leads to the optimal batch size: 𝐵 ′′ = 𝑀 𝐿𝐷𝑇 (𝑐/2 + 𝑇 /𝑐 2 ) and the number of training steps to be: 𝑋 ′′ = 𝑁 𝐵 ′′𝑇 = 𝑁𝐷𝐿 𝑀 (𝑐/2 + 𝑇 /𝑐 2 ) Finally, we estimate our proposed speedup in total training time as 𝑋/𝑋 ′′ = 𝑇 𝑐/2 + 𝑇 /𝑐 2 Plugging in 𝑐 = 5.5 as a conservative number of characters per word8 and 𝑇 = 192 context window length, we get a 6.8𝑥 speed-up in training steps, which is only marginally less than the subword speed-up (7.8x) relative to character level langauge model. 7.4.2 Generation Speed-up Another unique advantage of our end-to-end tokenized model is in generation, which is also parallelized per word. A character/byte model must generate one token at a time, then feed the predicted token back into the input and run the forward loop again for autoregressively generating the next. Assuming the 𝐿 layers of a decoder take 𝑡 seconds for the forward iteration of GPT-like decoder, the generation speed for such a character based model will be 1/𝑡 characters per second. 8Real values: En 5.5, Fr 5.2, Ru 6.4. 82 Subword models benefit from having tokens that are longer (roughly 2.8 characters/subword for the three languages we consider), therefore they can generate at a speed of 2.8/𝑡 characters per second. With a very coarse assumption, our end-to-end character model with 𝐿/4 word-encoder layers and 𝐿 decoder layers (ignore the 𝐿/4 word-decoder layers for now) will require 5𝑡/4 seconds to generate the representation of one word at a time. The next step can then be parallelized (with trade-off in memory consumption) to both autoregressively go on generating the next word representation in another 5𝑡/4 seconds, as well as autoregressively generating one character at a time using this predicted word representation. This word-level decoder that emits characters currently has 𝐿/4 layers so a crude assumption would mean 𝑡/4 seconds per character. Therefore, at steady state, the word-decoder will take 5.5𝑡/4 seconds to generate the average 5.5 characters per word, while the next word will be ready for decoding simultaneously in just 5𝑡/4 seconds. Thus, the generation speed is 4/𝑡 characters per second, i.e. roughly 50% better than subwords and four times as fast as tokenizer-free models. 7.5 Discussion State-of-the-art tokenization approaches include subword segmentation schemes such as WordPiece (Wu et al., 2016b), Byte Pair Encoding or BPE (Sennrich et al., 2016a), and Unigram (Kudo, 2018a), all of which are statistical methods for preprocessing a large unlabeled corpus of text to yield a fixed vocabulary, midway between characters or bytes at one end and whole words at the other. This results in a convenient trade-off in sequence description length while avoiding the UNK token, that is, a fallback mechanism for handling rare words. However, it is not obvious why 83 these hand-engineered algorithms would be the optimal forms of tokenization and whether there exists a possibility for end-to-end models to also include this crucial stage of the NLP pipeline. Subword tokenization is efficient but too rigid and deterministic. Character/Byte-level models on the other hand are too expressive, which leads to inefficient training and inference. We propose a word-boundary-informed tokenizer that efficiently and robustly performs language modeling in a hierarchical, end-to-end, learned model. We show that it outperforms by over 300% both extremes: subwords and character/byte models. We also analyze its trade-offs in training and inference efficiency. Despite its many flaws including reliance on a word boundary signal and moderate efficiency as well as moderate expressiveness, we expect this preliminary study to pose an interesting trade-off tokenization for truly end-to-end language modeling. Our code is released on Github. 84 Chapter 8 Downstream effects: Learnt Tokenization on Machine Translation Byte Pair Encoding (Sennrich et al., 2016c), the default method used in most language models, starts with a vocabulary of only the 256 possible bytes and repeatedly merges the tokens that occur most frequently next to each other (e.g., 𝑡 +ℎ → 𝑡ℎ;𝑡ℎ+𝑒 → 𝑡ℎ𝑒; . . . ). The vocabulary of GPT-4, for instance, is obtained after 100,000 such merges, leading to some arguably unnecessary tokens like .translatesAutoresizingMaskIntoConstraints, //——————————————————————————\n\n, and abcdefghijklmnopqrstuvwxyz 1 . 1Source of GPT-4 vocabulary: https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a09193 Figure 8.1: Left: Non-concatenative morphology in Arabic often interleaves letters within the root (Clark et al., 2022). Right: Subword tokenization in GPT-4 instead only captures ‘contiguous’ sequences of characters. 85 Recent work has shown countless limitations with BPE subwords. Technical domains such as biomedical documents (Boecking et al., 2022b), source code (Dagan et al., 2024), and financial articles (Thawani et al., 2023b) benefit from pre-training their own tokenizer for improved language understanding. Another key dimension where subwords lack is language inclusivity (Team et al., 2022). Chinese characters, for instance, can be often represented better at the stroke level (Si et al., 2023). On the other hand, non-concatenative languages like Arabic can benefit from capturing long-range dependencies and not only contiguous patterns in characters - as seen in Figure 8.1. The research community has proposed several alternative tokenizers to improve NLP models (Thawani et al., 2023a; Clark et al., 2022; Kumar and Thawani, 2022; Fleshman and Durme, 2023). However, each of these tokenizers also modifies the model architecture, number of parameters, vocabulary size, and/or the training corpus, thereby confounding the benefits of only the tokenizer vocabulary (see Table 8.1). This chapter studies the effects of switching to a more expressive tokenizer while controlling for all the above confounders, in the context of neural machine translation. Our preferred alternative to subwords is a codebook learnt using vector quantization when autoencoding words in different languages (Samuel and Øvrelid, 2023) . It is a lossless arrangement of the vocabulary space that does not merely segment character sequences on the surface level, instead learns longer range dependencies among the constituent characters. We borrow the intermediate Factorizer tokenization depicted in Figure 8.2 and described in Section 8.2. We acknowledge that codebook-learned tokenizers have several shortcomings. They are not as directly interpetible as subwords. They require training from scratch since most pretrained language models today use subword vocabularies instead. They lack the inductive bias that 86 Tokenizer Citation Architecture Vocab Size Parameters Train Data FastText Bojanowski et al. (2017) No No No No ELMo Peters et al. (2018b) No No No No CharBERT El Boukkouri et al. (2020) Yes No No Yes CharFormer Tay et al. (2021) No No Yes Yes LOBEF Sreedhar et al. (2022) No No No Yes CANINE Clark et al. (2022) No No No Yes ByT5 Xue et al. (2022) No No Yes Yes MegaByte Yu et al. (2023) No No No Yes RetVec Bursztein et al. (2023) No No No Yes eByte/eChar Thawani et al. (2023a) No No Yes Yes Factorizer Samuel and Øvrelid (2023) Yes Yes Yes Yes Table 8.1: Literature Review of alternative tokenizers and what they control for. We work with Factorizer, the only tokenizer that controls for all dimensions and makes it possible to compare directly against a subword vocabulary. characters appearing close may form coherent units, which limits expressivity but is nonetheless a useful bias (Cao, 2023). Nevertheless, we believe our empirical and controlled analysis of their performance in machine translation offers several contributions: 1. We are the first to compare BPE tokenizers to a learnt vocabulary with the same size and the same architecture on the downstream task of Neural Machine Translation. 2. We show that while BPE outperforms Factorizer in general, the latter is more robust to noise and for very short and very long sentences (outperforms by as much as 70%). 3. We analyze why Factorizer prefers non-concatenative morphologies like Arabic. 87 8.1 Background Here, we describe the key tokenization strategies that we compare without modifying the underlying model architecture in any way. We refer the interested reader to Chapter 2 or Mielke et al. (2021b) for a deeper survey on tokenization in NLP. 8.1.1 Bytes Most natural language text on the internet is encoded using UTF-8 byte encodings, therefore a byte-level representation of text makes for a convenient option. Their vocabulary size is restricted to a mere 256 possible bytes, and most Latin languages require a single byte per character. Such approaches (Xue et al., 2022; El Boukkouri et al., 2020), however, suffer from being slow to infer due to large description lengths, particularly on non-Latin scripts (Edman et al., 2023). 8.1.2 Byte Pair Encoding The modern workhorse of tokenization in NLP is a heuristic atop byte representations called Byte Pair Encoding. Starting from a base of 256 bytes and a training corpus, the most frequently occurring byte pairs are incrementally merged, e.g., t+h →th, th+e→the, and so on. Nearly all large language models today (Touvron et al., 2023a,b; Groeneveld et al., 2024; Jiang et al., 2023) rely on Byte Pair Encoding as their base tokenizer, with different number of merges. GPT3 (Brown et al., 2020) uses a vocabulary of 50,257 BPE tokens (50,000 merges and a special token) while GPT4 (OpenAI, 2023) pushes it further to 100,000 merges. 88 One of the main goals of this chapter is to control for dimensions like vocabulary size, hence we train our own BPE on the training set of each dataset (independently for source and target sides) with a final size of 794 BPE tokens - the same as the factorizer (see next section). 8.2 Methodology We reuse the Factorized Subword Encoding Samuel and Øvrelid (2023), which trains an autoencoder to learn to decompose subwords into triplet codes, each ranging from 0 − 255, resembling an RGB color code2 . Such a factorization helps construct tokens with compositional units, e.g., 𝑚𝑒𝑙𝑜𝑛 is represented as [30, 255, 209], 𝑚𝑒𝑙𝑜𝑛𝑠 as [261, 255, 209] and 𝑤𝑎𝑡𝑒𝑟𝑚𝑒𝑙𝑜𝑛𝑠 as [208, 235, 109], [45, 255, 209], sharing most of their encoding. We refer the interested reader to the original paper for more implementation and training details, which we summarize in Figure 8.2. They focus on pooling these RGB embeddings to give a single vector representation per subword, and then use them in a BERT-style model for morpho-syntactic tasks. We merely borrow their autoencoding codebook to discretize text in the same way as a BPE tokenizer would. Their original vocabulary size is 256 x 3 (one each for RGB) equivalent to 768 unique tokens. Another alternative we try is to keep the vocabulary size 256 and let the model’s positional encodings learn patterns that inform whether a given code represents the R, the G, or the B part of a token’s representation. We use both variants in our experiments, distinguishing them by the size of their vocabulary as Factorizer 794 and Factorizer 2583 . They correspond nearly perfectly to the vocabulary sizes of our baselines: BPE (794) and Bytes (256). 2Unlike the RGB continuous spectrum, here [0, 1, 2] may have more in common with [39, 40, 41] than with [1, 2, 3]. 3Corresponding to 768 and 256 respectively but with a few additional special tokens to denote [𝐵𝑂𝑆], [𝐸𝑂𝑆], etc. 89 Figure 8.2: Pictorial depiction of how the Factorizer (Samuel and Øvrelid, 2023) learns token embeddings as an autoencoder (seen here reconstructing the word ‘do’) where the final summed embeddings of the word are used to evaluate on syntactic tasks. We specifically borrow these intermediate codes labelled Factorizer 258 and Factorizer 794 in our chapter as stand-in replacements for a BPE tokenizer, enabling fair comparison on NMT. 90 Language Pair Dataset Type Versions # Sentences Size (MBs) # Chars/Sentence French-English Europarl Training v7 2,002,756 647.69 Fr-166.69; En-147.66 (Fr-En) News Commentary Training v16 365,510 116.05 Newstest Development 2010 2,489 0.71 Fr-147.53 ; En-130.88 Newstest Test 2011 3,003 0.85 Fr-141.48 ; En-126.0 Newstest Test 2012 3,003 0.82 Fr-146.67 ; En-131.06 Newstest Test 2013 3,003 0.72 Fr-126.41 ; En-109.98 German-English Europarl Training v10 1,817,758 585.08 De-167.45 ; En-147.06 (De-En) News Commentary Training v16 388,482 120.34 Newstest Development 2017 3,004 0.71 De-122.04 ; En-111.07 Newstest Test 2018 2,998 0.74 De-107.27 ; En-101.98 Newstest Test 2019 2,000 0.43 De-126.66 ; En-116.22 Newstest Test 2020 785 0.43 De-282.84 ; En-263.92 Spanish-English Europarl Training v7 1,960,641 619.08 Es-161.68 ; En-147.58 (Es-En) News Commentary Training v16 369,540 114.09 Newstest Development 2010 2,489 0.69 Es-142.36 ; En-130.88 Newstest Test 2011 3,003 0.83 Es-140.73 ; En-131.06 Newstest Test 2012 3,003 0.81 Es-123.09 ; En-109.98 Newstest Test 2013 3,003 0.71 Es-138.57 ; En-126.0 English-Arabic Flores200 Training v1 997 0.33 En-289.44 ; Ar - 353.62 (En-Ar) News Commentary Training v16 140,929 132.74 UN Test Development v1 4,000 1.79 En-175.36 ; Ar - 148.38 Flores200 devtest Test v1 1,012 0.34 En-130.4 ; Ar-114.93 Spanish-Arabic Flores200 Training v1 997 0.36 Es-335.49 ; Ar-351.81 (Es-Ar) News Commentary Training v16 132,616 130.82 UN Test Development v1 4,000 1.9 Es-200.63 ; Ar-148.38 Flores200 devtest Test v1 1,012 0.37 Es-155.14 ; Ar-114.93 French-Arabic Flores200 dev Training v1 997 0.35 Fr-345.85 ; Ar-354.56 (Fr-Ar) News Commentary Training v16 104,009 105.57 UN Test Development v1 4,000 1.91 Fr-198.43 ; Ar-148.38 Flores200 devtest Test v1 1,012 0.38 Fr-155.77 ; Ar-114.93 Czech-Arabic News Commentary Training v16 71,080 69.18 Cs-281.97; Ar-359.9 (Cs-Ar) Flores200 dev Development v1 997 0.34 Cs-122.18 ; Ar-110.91 Flores200 devtest Test v1 1,012 0.35 Cs-125.75 ; Ar-114.93 Czech-English Europarl Training v10 644,426 192.45 Cs-134.62 ; En-142.27 (Cs-En) News Commentary Training v16 253,266 72.54 Flores200 dev Development v1 997 0.26 Cs-122.18 ; En-125.57 Flores200 devtest Test v1 1,012 0.28 Cs-125.75 ; En-130.4 Czech-French News Commentary Training v16 200,137 65.22 Cs-135.02 ; Fr-165.18 (Cs-Fr) Flores200 dev Development v1 997 0.3 Cs-122.18 ; Fr-149.24 Flores200 devtest Test v1 1,012 0.31 Cs-125.75 ; Fr-155.77 Czech-Spanish News Commentary Training v16 222,682 71.05 Cs-136.24 ; Es-160.8 (Cs-Es) Flores200 dev Development v1 997 0.29 Cs-122.18 ; Es-149.64 Flores200 devtest Test v1 1,012 0.3 Cs-125.75 ; Es-155.14 Table 8.2: Summary of our Training, Development, and Test Datasets on ten language pairs. 91 8.3 Experiment Setup Our primary research question is to evaluate a learnt Factorizer vocabulary with BPE subwords. We operationalize this in the form of a neural machine translation experiment to compare different tokenizers where the same model is trained from scratch on the same dataset for the same number of epochs with the same optimizer configuration. Model Our base model is a 6 layer transformer encoder-decoder (Vaswani et al., 2017) that has 8 attention heads, 512 hidden vector units, and a feed forward intermediate size of 2048, with GeLU activation (Hendrycks and Gimpel, 2023). We use label smoothing at 0.1, and a dropout rate of 0.1. We use the RTG 4 library for model implementation and an extended version of NLCodec library (Gowda et al., 2021b) for tokenization. Datasets: We use a variety of machine translation datasets in our experiments, preprocessed with the Moses tokenizer (Koehn et al., 2007b). For each language pair, we summarize our training, development, and test sets in Table 8.2, each based on the following source: 1. Europarl Corpus: Originating from the European Parliament proceedings, this multilingual dataset is focused on political and legislative language (Koehn, 2005). 2. News Commentary Corpus: This corpus includes multilingual news commentary articles, with exposure to current events and journalistic language (Tiedemann, 2009). 3. WMT Newstest Sets: Part of the annual Workshop on Machine Translation evaluation, these news article sets are used for benchmarking translation system performance (Kocmi et al., 2022). 4https://github.com/isi-nlp/rtg 92 4. Flores Benchmark: Designed for evaluating translation in low-resource languages, Flores includes a broad domain range, improving model versatility (NLLB Team et al., 2022). 5. United Nations (UN) Test Sets: Derived from official UN documents, this dataset introduces models to complex diplomatic and international terminology (Ziemski et al., 2016). Factorizer 794 BPE 794 Byte 258 Factorizer 258 BLEU chrF BLEU chrF BLEU chrF BLEU chrF 𝐸𝑛 → 𝐷𝑒 22.4 ± 4.4 53.4 ± 3.0 22.7 ± 4.6 54.4 ± 3.2 25.2 ± 5.2 55.6 ± 3.4 20.8 ± 4.0 52.2 ± 2.9 𝐸𝑛 → 𝐹𝑟 22.4 ± 0.7 53.7 ± 1.0 21.6 ± 2.2 53.1 ± 2.3 25.1 ± 0.7 56.0 ± 0.9 24.0 ± 0.7 52.7 ± 1.0 𝐸𝑛 → 𝐸𝑠 28.0 ± 1.5 54.8 ± 1.3 29.3 ± 1.5 56.2 ± 1.3 32.1 ± 1.8 56.9 ± 1.6 27.9 ± 1.5 54.1 ± 1.3 𝐸𝑛 → 𝐶𝑠 19.3 ± 0.1 48.0 ± 0.2 19.9 ± 0.2 49.0 ± 0.1 22.2 ± 0.1 50.4 ± 0.1 18.9 ± 0.3 47.3 ± 0.1 𝐸𝑛 → 𝑥𝑥 23.0 52.5 23.4 53.2 26.2 54.8 22.9 51.6 𝐴𝑟 → 𝐸𝑛 20.5 ± 0.3 48.5 ± 0.3 22.2 ± 0.1 49.8 ± 0.5 21.2 ± 0.7 48.2 ± 0.3 17.7 ± 0.1 45.0 ± 0.2 𝐴𝑟 → 𝐹𝑟 13.9 ± 0.5 42.4 ± 0.1 15.0 ± 0.3 44.1 ± 0.1 11.2 ± 0.8 38.7 ± 0.7 11.1 ± 0.1 38.6 ± 0.1 𝐴𝑟 → 𝐸𝑠 12.6 ± 0.3 39.7 ± 0.3 13.2 ± 0.1 40.9 ± 0.1 4.9 ± 3.3 27.3 ± 6.2 10.5 ± 0.2 37.4 ± 0.2 𝐴𝑟 → 𝐶𝑠 6.3 ± 0.1 30.0 ± 0.2 6.4 ± 0.2 31.3 ± 0.1 4.4 ± 0.1 25.7 ± 0.1 4.3 ± 0.2 26.7 ± 0.2 𝐴𝑟 → 𝑥𝑥 13.3 40.2 14.2 41.5 10.4 35.0 10.9 36.9 𝐶𝑠 → 𝐸𝑛 25.2 ± 0.2 54.1 ± 0.1 26.4 ± 0.1 55.6 ± 0.2 27.0 ± 0.3 55.8 ± 0.2 23.8 ± 0.1 53.1 ± 0.1 𝐶𝑠 → 𝐴𝑟 4.2 ± 0.2 31.3 ± 0.1 4.7 ± 0.3 33.5 ± 0.2 4.5 ± 0.1 32.8 ± 0.1 3.6 ± 0.2 29.8 ± 0.1 𝐶𝑠 → 𝐹𝑟 16.6 ± 0.1 45.3 ± 0.2 17.5 ± 0.2 46.6 ± 0.2 18.5 ± 0.1 47.5 ± 0.2 14.8 ± 0.1 43.5 ± 0.1 𝐶𝑠 → 𝐸𝑠 13.4 ± 0.1 41.6 ± 0.1 14.0 ± 0.2 42.5 ± 0.2 15.2 ± 0.1 43.0 ± 0.3 12.3 ± 0.3 40.2 ± 0.1 𝐶𝑠 → 𝑥𝑥 14.9 43.1 15.7 44.6 16.3 44.8 13.6 41.7 Table 8.3: Comparison of different source tokenizers with the target fixed (xx → BPE-8K) across 12 language pairs, along with standard deviations over 3 runs with different random seeds. English source experiments are averaged over three different test sets, resulting in higher variance. We also report (micro) averages grouped by source language. Takeaway: Factorizer does not outperform BPE but is better than Bytes when translating Arabic. Training and Evaluation We use the Adam optimizer (Kingma and Ba, 2017) with a controlled learning rate that warms up for 16K steps followed by a decay rate recommended for training 93 transformer models. Each model is trained from scratch, and the hyperparameters (per language pair) are chosen by grid search to optimize the baseline validation BLEU. We train all models for up to 100, 000 steps (early stop by development loss with a patience of 5) with batch size 24, 000. We report sacreBLEU (Post, 2018b) and chrF (𝛽 = 2) scores (Popović, 2015). As is common in machine translation experiments, our models do not share source and target vocabularies. In most experiments below, we further isolate the effects of tokenization to a single side (source or target) while fixing the other side to be the default baseline with 8, 000 BPE tokens. Doing so at the target side has the added advantage that the autoregressive decoding speed at inference is unaffected by the source vocabulary, which is one of the prominent critiques against, say, byte-level models. 8.4 Results The purpose of this work is to compare traditionally used tokenizers like Byte and BPE subwords to the learnt tokenizers: Factorizer 258 and Factorizer 794. We break down our results into the following research questions: 8.4.1 How well do learnt tokenizers encode source and decode target text? We first experiment with different source-side tokenizers while keeping the target side as BPE 8K. Table 8.3 shows that Factorizer (794) does not outperform BPE but is better than Bytes when translating Arabic to other languages. We theorize that the Bytes tokenizer does relatively better on English primarily due to how UTF-8 encodes each Latin alphabet with a single byte each, whereas Arabic alphabets require two bytes each. 94 Figure 8.3: BLEU scores on target side with the source side fixed as (xx ← BPE-8K) across six language pairs. BPE consistently outperforms Factorizer. Based on the above results, we further experiment with the two best tokenizers BPE 794 and Factorizer 794 at target-side in machine translation. The smaller vocabulary Byte and Factorizer 258 tokenizers are also particularly slow at inference, since they must autoregressively decode more number of times for the same sentence than BPE 794 and Factorizer 794. Figure 8.3 shows again that while Factorizer performs competitively with BPE, it is unable to beat it for any of the six language pairs. In the following sections, we perform further ablations primarily on the Arabic-English translation task, since Factorizer shows relative promise in encoding Arabic. Moreover, the Ar → En task helps us qualitatively analyze model outputs in English (Section 8.4.4). 95 Figure 8.4: Data Scarcity: BLEU scores over Ar → En with different source-side tokenizers (targetside fixed at BPE 8k). Most tokenizers lose performance in a low resource setting but Factorizer 794 gains the most. 8.4.2 How robust are tokenizers to data scarcity? Prior work (Samuel and Øvrelid, 2023) has shown the benefits that alternative tokenizers have when training with low resources. Here, we evaluate the relative drop in performance of our models when trained on lower resources. More specifically, we experiment with Arabic → English translation where the training set is now UN Test (4,000 examples) and the development set is Flores 200 (997 examples). In the high resource setting, the total training set had 141,926 examples and the development set had 4,000 examples. For fair comparison, our test set in both settings is Flores 200 devtest (1,012 examples). Figure 8.4 reports BLEU scores when comparing different source-side tokenizers, keeping target-side tokenizer fixed at our default BPE 8k. We find that while most tokenizers lose some score in the low resource setting, Factorizer 794 on the contrary gains the most, demonstrating better robustness to data scarcity. 96 Figure 8.5: Ar→En relative BLEU scores (100 denotes noiseless5 ) with varying degrees of noise added to the test source sentences. Factorizer performance relatively degrades less than BPE as noise increases. 8.4.3 How robust are tokenizers to noise? Following Samuel and Øvrelid (2023) we experiment with adding different degrees of artificial noise in our Arabic→English experiments with BPE 794-BPE 794 and Factorizer 794-Factorizer 794 5 . We add, remove, or replace each non-space character with a certain probability in the test set source sentences (Arabic); the training set remains uncorrupted in each case. In line with previous work, Figure 8.5 find that Factorizer performance relatively degrades less than BPE as noise increases. 5The noiseless BLEU scores are respectively 23.4 and 20.1 (in line with above results). 97 Length Factorizer-794 BPE-794 <10 17.33 10.73 [10,20) 15.06 16.65 [20,30) 17.45 18.63 [30,40) 20.22 19.30 [40,50) 18.62 19.43 [50,60) 17.98 19.58 >=60 45.30 33.16 Table 8.4: BLEU scores on Arabic → English stratified by lengths. Factorizer particularly outperforms when the reference is either very short or very long. 8.4.4 Do different tokenizers specialize in different kinds of translations? We note in Table 8.3 how Byte-tokenized models work better for Latin scripts than non-Latin ones. This can be possibly explained by the inherent bias within UTF-8 encoding scheme which yields a single byte to all Latin characters but as many as three bytes per character for languages that appear later in the Basic Multilingual Plane (BMP). Here, we ask similarly what other factors may influence the performance of a tokenizer in machine translation. We use the Compare-MT (Neubig et al., 2019) library to stratify results according to source length, target length, frequency of words, presence of key phrases, and other dimensions. Table 8.4 depicts a stratification by length of target reference. We find that Factorizer significantly outperforms BPE on very short and very long translations, by as much as 70%. Table 8.5 also highlights such representative samples from the test set of our Arabic → English experiments. 98 8.4.5 Can we quantify the morphological preference of tokenizers? Our experiments show that relatively, Factorizers perform better on Arabic than say, English. We note in Figure 8.1 how the non-concatenative morphology of Arabic may be a factor behind this result. In this subsection, we further quantify this intuition. We test the hypothesis of whether BPE and Factorizer are separately suited to be better at different kinds of morphologies. To this end, we cluster the top 10,000 words in both Arabic and English by their root form (Sylak-Glassman, 2016; van der Zwaan et al., 2019), e.g., the root form have maps to the following common words: have, has, had, having. Next, we tokenize each such word using the two tokenizers (BPE 794 and Factorizer 794), and count the subset of encoding that is ‘most representative’ of the root cluster. We define representativeness here as the fraction of words that share this code within this cluster. For example, if two of the above four forms of the root have include a code ha## and six other English words also include this code, then the representativeness score for this cluster in BPE is 2 8 = 0.25. We plot the histograms of representativeness scores over 1,410 English roots and 73 Arabic ones in Figures 8.6 and 8.7. Distributions that are shifted towards the right side on the X-axis indicate a more representative code that captures root forms. We observe that while BPE subwords are better suited to the concatenative morphology of English, Arabic root forms that share nonconcatenative morphological features are better encapsulated by the learnt codes in Factorizer (blue distribution leans more to the right, i.e., higher representativeness). 99 Figure 8.6: Representativeness in English. BPE 794 codes well represent more root forms than Factorizer 794 (rightwards is better). See Section 8.4.5 for details. Figure 8.7: Representativeness in Arabic. Factorizer 794 codes well represent more root forms than BPE 794 (rightwards is better). See Section 8.4.5 for details. 8.5 Related Work Some recent work has challenged subword tokenization schemes. Table 8.1 highlights the different kinds of alternative tokenizations existing in prior work and why this chapter works with the Factorizer, the only tokenizer that controls for all dimensions and makes it possible to compare directly against a subword vocabulary. This section summarizes the different efforts by the community towards alterantive tokenization: Character/Byte-level ByT5 (Xue et al., 2022), CANINE (Clark et al., 2022), and SubChar (Si et al., 2021) propose using very small fixed-length units such as characters, bytes, or glyph strokes instead of dynamic-length subwords or words. This often comes at the expense of larger sequence lengths and more compute requirements, especially for a transformer architecture which typically has a complexity of O (𝑛 2 ) in number of input tokens. Edman et al. (2023) investigate byte and subword-level models for machine translation. 100 Beyond word level CodeBPE (Chirkova and Troshin, 2022) and Multi Word Expressions (Kumar and Thawani, 2022; Zaninello and Birch, 2020; Rikters and Bojar, 2017) show promise in yet larger tokens that cross word boundaries, e.g., a vocabulary with single tokens for the strings “for i in range” or “New York City” respectively. Learnt subword segmentation Some methods (Mofijul Islam et al., 2022; Kaushal and Mahowald, 2022; Pinter et al., 2021; Tay et al., 2021; Provilkov et al., 2020; Wang et al., 2021) parameterize the process of segmentation by pooling character n-grams or sampling one of the many ways to segment a given word. In contrast, we are interested in a different rearrangement of the vocabulary that does not segment words at the surface level alone. Domain specific tokenization Several domains have benefited from a custom tokenization strategy (Dagan et al., 2024). Numbers are often inconsistently segmented into subwords, leading to decreased arithmetic (Wallace et al., 2019) and estimation (Thawani et al., 2021c) skills. The extent of these numeric limitations is so dire that GPT-4 (OpenAI, 2023) has an explicit workaround of adding all numbers from 0 to 999 as individual tokens to the model’s vocabulary. Boecking et al. (2022a) train a better tokenizer for the biomedical domain and Dagan et al. (2024) perform a similar analysis over code language models. 8.6 Discussion Subword tokenization is a heuristic to find contiguous pieces of characters that occur frequently, e.g., prefixes (dis-) and suffixes (-ing). However, natural language includes many more diverse patterns involving longer range dependencies, e.g., non-concatenative morphology in Arabic 101 (Figure 1). A more expressive method to find such dependencies is to learn a vector-quantized codebook of tokens from raw bytes. We evaluate such learnt tokenizers on the task of machine translation across six language pairs and find that while they do not outperform subwords in general, they are more robust to misspellings and better on very short and very long sentences (by as much as 70%). We also demonstrate why they have a preference for representing nonconcatenative morphologies. In conclusion, our study explored the impact of tokenization schemes on neural machine translation performance by comparing traditional Byte Pair Encoding (BPE) with a recent, learned tokenizer known as Factorizer. Our experiments, conducted across six language pairs, revealed that while BPE continues to hold its ground as the superior tokenizer in most scenarios, Factorizer shows promise, particularly when translating from Arabic. Notably, Factorizer outperformed BPE in translating very short and very long sentences, indicating its potential in handling edge cases effectively. We rigorously analyze one of the factors influencing this relative preference for BPE towards inflectional morphologies like English and Factorizer towards non-concatenative morphologies like Arabic. We find that learnt codebooks better represent the non-concatenative root forms in Arabic than subword heuristics (Figure 8.7). Our findings underscore the importance of continuing to explore and refine tokenization techniques in the field of neural machine translation. While BPE remains a strong baseline, the potential for improvement with learned tokenizers like Factorizer warrants further investigation, particularly in language pairs and scenarios where traditional methods may falter. 102 Text SentBLEU Reference The harbor was the site of an infamous naval standoff in 1889 when seven ships from Germany, the US, and Britain refused to leave the harbor. Factorizer The facility was the site of a notorious sea-lane confrontation in a little-noticed year when seven ships from Germany, the US, and Britain refused to leave the air. 55.20 BPE Seven ships from Germany, the United States, and Britain refused to leave. 14.94 Reference The Internet combines elements of both mass and interpersonal communication. Factorizer The Internet combines elements of both mass and private communication. 80.50 BPE The Internet brings together elements of both public and personal communication. 26.78 Reference Argentina is well known for having one of the best polo teams and players in the world. Factorizer Argentina is famous for having one of the best teams and Buddhist players in the world. 52.86 BPE Argentina is notorious for the existence of one of the world ’ s best statesmen. 17.40 Reference Christmas is one of the most important holidays of Christianity, and is celebrated as the birthday of Jesus. Factorizer Christmas is one of Christianity ’ s most important Christmas habits, celebrated as Christmas. 23.41 BPE Christmas is one of the most important holidays of Christianity, and is celebrated as Christmas ’s birthday. 76.83 Reference As knowledge of Greek declined, the West found itself cut off from its Greek philosophical and scientific roots. Factorizer While knowledge has declined in Greeks, the West has found itself insulated from its philosophical roots and Greek science. 13.80 BPE As Greek knowledge declined, the West found itself isolated from its philosophical and scientific roots. 42.68 Reference A couple may decide it is not in their best interest, or in the interest of their child, to raise a baby. Factorizer She may decide that she is neither good nor in her child ’ s interest to rank a baby. 10.37 BPE uan may decide that it is not in their interest, or in the interest of their child, to have a baby. 60.26 Table 8.5: Representative samples of Arabic → English translations - three examples each of where Factorizer significantly outperforms BPE and vice versa (as measured by Sentence BLEU). We highlight the winning system’s successes and failures. 103 Chapter 9 Conclusion This dissertation has explored the limitations of subword tokenization, a crucial yet often overlooked component in the language modeling pipeline. We demonstrated that subwords, while effective for many NLP tasks, struggle to capture the nuances of numbers, multi-word expressions, and non-concatenative languages. To address these limitations, we proposed and evaluated a novel approach: aggregating symbols based on their inherent meaning and relationships. We demonstrated that aggregating symbols on the number line can significantly enhance both the numeracy and literacy of language models. This led to improved performance on tasks such as masked number prediction, numerical fact estimation, and even word prediction in non-numeric contexts. Furthermore, we showed that learning to aggregate symbols through an end-to-end tokenized language model can outperform both subword and character-level tokenization, leading to significant gains in language modeling capabilities across multiple languages and datasets. Finally, we investigated the downstream effects of learned tokenizers on machine translation, highlighting their robustness to noise and data scarcity, particularly in the context of non-concatenative languages. 104 9.1 Future Work This thesis only scratches the surface of the vast potential of aggregating symbols for language modeling. Several promising directions remain unexplored, some of which we explore in the following sections. 9.1.1 Downstream effects: Financial Language Modeling The financial domain presents a unique and challenging application for natural language processing. Its lexicon is filled with specialized jargon, numerical data abounds, and the underlying semantic relationships are often intricate and context-dependent. Understanding financial language requires not only deciphering technical terms and interpreting numerical information but also grasping the subtle nuances of sentiment, risk, and market dynamics. Large language models (LLMs) have shown promise in tackling a variety of financial NLP tasks, from sentiment analysis and risk assessment to portfolio optimization and fraud detection. However, their success is heavily reliant on the foundational step of tokenization, which determines how financial text is represented and processed by the model. As we have explored in previous chapters, traditional subword tokenization methods, while efficient, often fall short in capturing the complexities of financial language, particularly when it comes to representing numbers and domain-specific vocabulary. This chapter considers the downstream effects of alternative tokenization strategies on financial language modeling, specifically focusing on the task of predicting stock market performance. We investigate whether tokenizers that go beyond subword segmentation and learn to aggregate 105 symbols based on their inherent meaning and relationships can enhance the ability of LLMs to extract valuable insights from financial documents and accurately forecast stock prices. The motivation for exploring alternative tokenization strategies in financial language modeling stems from the inherent limitations of subword tokenization in handling: 1. Financial numeracy: Financial documents are rife with numerical data, from stock prices and trading volumes to financial ratios and economic indicators. Accurately representing and interpreting this numerical information is crucial for understanding market trends and making informed investment decisions. However, as we have seen in Chapters 3 and 4, subword tokenization often leads to inconsistent and suboptimal representations of numbers, potentially hindering the ability of LLMs to grasp the quantitative aspects of financial language. 2. Domain-specific terminology: The financial domain employs a vast lexicon of specialized jargon and technical terms, many of which are rarely encountered in general language corpora. Subword tokenization, trained on massive but predominantly general-domain datasets, may struggle to adequately represent these domain-specific terms, leading to suboptimal semantic representations. 3. Long-range dependencies: Financial narratives often involve complex causal relationships and long-range dependencies between events, concepts, and numerical data points. Subword tokenization, primarily focused on local character sequences, may fail to capture these intricate connections, limiting the model’s ability to grasp the full context and understand the underlying market dynamics. 106 By leveraging tokenization strategies that learn to aggregate symbols based on their meaning and relationships within the financial domain, we hypothesize that LLMs can develop a more nuanced understanding of financial language and improve their ability to predict stock market performance. There exist several data sources for evaluating tokenization of language models on financial documents: 1. SEC filings (10-K and 10-Q reports): These quarterly and annual reports filed by publicly traded companies contain comprehensive information about their financial performance, business operations, and risk factors. 2. Earnings call transcripts: Quarterly earnings calls provide insights into a company’s recent performance and future outlook, often accompanied by management commentary and analyst questions. Recent studies have shown promise in using LLMs for tasks such as deciphering financial jargon (Hansen and Kazinnik, 2023), analyzing earnings call transcripts (Chin and Fan, 2023), and assessing the information content of financial news (Lopez-Lira, 2023). Our work specifically focuses on the impact of tokenization on financial language modeling, particularly in the context of predicting stock market performance. Previous research has highlighted the limitations of LLMs when relying solely on historical numerical data for prediction tasks (Xie et al., 2023; Ko and Lee, 2023), suggesting the need for more nuanced approaches that effectively incorporate textual information. Furthermore, the development of specialized financial LLMs like BloombergGPT (Wu et al., 2023) underscores the importance of tailoring language models to the specific needs and intricacies of the financial domain. 107 9.1.2 Interpretability: Aggregation of Symbols inside LLMs The impressive capabilities of large language models (LLMs) often lead to a black-box perception, where their internal workings relatively unknown to the end user. While their outputs are impressive, understanding how LLMs arrive at those outputs is crucial for building trust and ensuring responsible use. This section explores a promising avenue for interpreting LLM behavior: analyzing how LLMs aggregate symbols internally, particularly within the context of attention. Recent research has shed light on the memory-like nature of attention in transformers. Geva et al. (2021) demonstrated that transformer feed-forward layers essentially serve as key-value memories, storing information related to specific concepts or patterns. Moreover, Dai et al. (2021) demonstrated the potential of identifying and manipulating “knowledge neurons" within LLMs, allowing for targeted updates to the model’s factual knowledge. This suggests that LLMs might be aggregating symbols in a way that resembles how humans categorize and store knowledge. The current thesis stresses on aggregating external symbols that enter a language model and on their output space. But this insight opens exciting possibilities for interpreting and potentially even editing the aggregation of symbols within language models, particularly in the key-query-value attention networks. Such capabilities will opens up exciting avenues for improving LLM accuracy, reducing biases, and enhancing their generalizability across diverse domains. Moreover, this knowledge can be instrumental in developing methods for interpreting, debugging, and even editing the knowledge stored within these powerful models. This thesis makes an attempt to solve tokenization in an interpretable yet learnt method, e.g., by aggregating learnt codes to represent surface level tokens. For instance, when we train eByte 108 Code Words that share the code (99, 52, 22, 34) ‘1919’, ‘1959’, ‘1957’, ‘1968’, ‘1949’, ‘1937’, ‘1967’, ‘1958’, ‘1978’, ‘1927’, ‘1956’, ‘1917’, ‘1908’, ‘1997’, ‘1939’, ‘1909’, ‘1979’, ‘1948’, ‘1969’, ‘1799’, ‘1918’, ‘1977’, ‘1987’, ‘1879’, ‘1976’, ‘1907’, ‘1947’, ‘1819’, ‘1839’, ‘1986’, ‘1996’, ‘1966’, ‘4916’ (9, 19, 68, 72) ‘$2’, ‘$20’, ‘$30’, ‘$40’, ‘$50’, ‘$5’ (34, 34, 76, 36) ‘kilometers’, ‘kilometres’ (25, 2, 6, 89) ‘east–west’, ‘west–east’ (90, 67, 57, 77) ‘following’, ’Following’ Table 9.1: Patterns found in tokens that share similar codes trained with eByte over Wiki-Convert. embeddings (Chapter 7) trained on our number-heavy Wiki-Convert dataset (Chapter 3), we found interesting patterns in factorizer-like (Chapter 8) codes. The codes here refer to codebook indices from the learnt eByte method. See Table 9.1 for some patterns, e.g., several years of the same era share similar codes. Similarly, some upper/lower case word variants also share codes. We also found that subset of codes are representative of common patterns, such as the first codebook index being ‘19’ had a 0.53 F1 score of predicting the token being numeric. We refer the interested reader to Section 4.4 for similar neuron probing experiments. 9.2 Limitations There exist several limitations to fundamental research on tokenization such as the one described in this thesis. First and foremost, the bitter lesson (Sutton, 2019) of machine learning suggests that several shortcomings of current language models can be overcome not by improving tokenization but by increasing training data. 109 On similar lines, recent large language models like GPT4o 1 have incorporated even larger vocabulary sizes of 200,000 tokens. However, we note that such methods only appear to be brute force but instead require custom logic, e.g., all Arabic numbers from 0-1000, and all Devanagari and Chinese numbers from 0-100 have been explicitly added to its vocabulary. Other similar adjustments have been recently made for Japanese 2 . While the first half of this thesis presents numeracy and multi word expressions as evidence of current shortcomings in tokenization, the later half instead converges to learnt tokenizers as a general purpose solution. We acknowledge that such codebook-learned tokenizers have several shortcomings: they are not as directly interpetible as subwords; they need to be trained on a corpus (though so do subword tokenizers), and cannot be plugged into a pretrained language model; and they lack the inductive bias that characters appearing close may form coherent units. Moreover, we found it challenging to reproduce the efficacy of factorizer tokenizers pretrained by Samuel and Øvrelid (2023). We believe this to be caused by our limited compute and limited training data availability. This thesis systematically and empirically analyses the effects of aggregating symbols in language models. We conduct downstream experiments, for example in, Fermi problems, Machine Translation, but we acknowledge that we do not exhaustively demonstrate the applicability of our methods to all NLP tasks. 1https://openai.com/index/hello-gpt-4o/ 2https://openai.com/index/introducing-openai-japan/ 110 Chapter 10 Ethical Considerations We now discuss several ethical considerations which were taken into account for the work presented in this dissertation. This includes information about our data collection procedure and its public release, the precautions taken for experiments that involved human annotated data, and details regarding the external datasets and LLMs used in our experiments. We conclude by providing general recommendations for ethically sound progress on tokenization in the future. Datasets Used: All the datasets used for the experiments presented throughout the manuscript had been completely anonymized before their release by the respective authors. We conducted a meticulous review of the licensing details for each dataset to ensure that our usage strictly adheres to their intended purposes and scope. LLMs Used: Our analysis with Large Language Models (LLMs) is strictly within the intended scope in accordance with the respective licensing details of the released models. Our approach is consistent with various other recent efforts that aim to evaluate the diverse capabilities of LLMs, ensuring that the use remains within ethical and operational guidelines. 111 Numeracy: This work revolves around the Hindu-Arabic Numeral system and English number words, which are not the only number systems still in use today. We encourage follow-up work to take these systems into consideration, on the lines of Johnson et al. (2020) and Nefedov (2020). Our Dataset: Wiki-Convert, our dataset has been extracted from Wikipedia dumps, which are licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License. The authors were in no way involved in the annotation process, which is contributed to by volunteer Wiki editors making use of the Convert template. We note, however, that the units of measurement we filter out (due to rare occurrences) will cause a cultural bias towards European and American units, such as pounds or miles, since they are over-represented in English Wikipedia. As a remedy, we shall release extraction scripts to enable researchers in creating other versions of Wiki-Convert, perhaps even supporting multiple languages. Tokenization: We acknowledge that research on tokenization in language models is one of the fundamental steps where language diversity is essential for an equitable outcome in Generative AI. Our work is in part an effort to evaluate tokenizers that make less assumptions about the morphology of the underlying language than BPE-like subword segmentation heuristics. We analyze in Section 8.4.5 how non-concatenative morphology in Arabic may influence the relatively better performance of factorizers than on English. Recommendations: The fast pace of AI research has made the development of formal ethical guidelines difficult. This makes responsible actions on the part of the research community all the more crucial. The inclusion of a focused section on ethical considerations in research papers and a separate ethics reviewing committee in *ACL conferences are both steps in the right direction. However, this is clearly not sufficient since it is nontrivial to foresee all the underlying 112 ethical concerns or the misbehavior of the systems beforehand. Hence, in addition to following software deployment best practices such as shadow deployment and canary release, promoting an opensource culture to the extent possible and incentivizing rigorous red teaming efforts can help mitigate these concerns. 113 Bibliography Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpretable math word problem solving with operationbased formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics. Daniel Andor, Luheng He, Kenton Lee, and Emily Pitler. 2019. Giving BERT a calculator: Finding operations and arguments with reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5947–5952, Hong Kong, China. Association for Computational Linguistics. Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics. Taylor Berg-Kirkpatrick and Daniel Spokoyny. 2020. An empirical investigation of contextualized number prediction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4754–4764, Online. Association for Computational Linguistics. Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. 2022a. Making the most of text semantics to improve biomedical vision–language processing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 1–21. Springer. Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. 2022b. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision, pages 1–21. Springer. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5:135–146. 114 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Sarah E Bullard, Deborah Fein, Mary Kay Gleeson, Nita Tischer, Robert L Mapou, and Edith Kaplan. 2004. The biber cognitive estimation test. Archives of clinical neuropsychology, 19(6):835–846. Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, and Alexey Kurakin. 2023. Retvec: Resilient and efficient text vectorizer. Kris Cao. 2023. What is the best recipe for character-level encoder-only modelling? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5924–5938, Toronto, Canada. Association for Computational Linguistics. Oralie Cattan, Christophe Servan, and Sophie Rosset. 2021. On the Usability of Transformersbased models for a French Question-Answering task. In Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria. Kushal Chawla, Gale Lucas, Jonathan Gratch, and Jonathan May. 2020. Bert in negotiations: Early prediction of buyer-seller negotiation outcomes. arXiv preprint arXiv:2004.02363. Chung-Chi Chen, Hen-Hsen Huang, and Hsin-Hsi Chen. 2020. Numclaim: Investor’s fine-grained claim detection. In Proceedings of the 29th ACM International Conference on Information amp; Knowledge Management, CIKM ’20, page 1973–1976, New York, NY, USA. Association for Computing Machinery. Chung-Chi Chen, Hen-Hsen Huang, Yow-Ting Shiue, and Hsin-Hsi Chen. 2018. Numeral understanding in financial tweets for fine-grained crowd-based forecasting. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pages 136–143. IEEE. Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, and Hsin-Hsi Chen. 2019. Numeracy-600K: Learning numeracy for detecting exaggerated information in market comments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6307–6313, Florence, Italy. Association for Computational Linguistics. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201– 228. Andrew Chin and Yuyu Fan. 2023. Leveraging text mining to extract insights from earnings call transcripts. Journal of Investment Management, 21(1):81–102. Nadezhda Chirkova and Sergey Troshin. 2022. Codebpe: Investigating subtokenization options for large language model pretraining on source code. In Deep Learning for Code Workshop. Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. 2022. Canine: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91. 115 Sara Cordes, Rochel Gelman, Charles R Gallistel, and John Whalen. 2001. Variability signatures distinguish verbal from nonverbal counting for both large and small numbers. Psychonomic bulletin & review, 8(4):698–707. Gautier Dagan, Gabriele Synnaeve, and Baptiste Rozière. 2024. Getting the most out of your tokenizer for pre-training and domain adaptation. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2021. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696. Stanislas Dehaene. 2011. The number sense: How the mind creates mathematics. OUP USA. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics. Abhijeet Dubey, Lakshya Kumar, Arpan Somani, Aditya Joshi, and Pushpak Bhattacharyya. 2019. “when numbers matter!”: Detecting sarcasm in numerical portions of text. In Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 72–80, Minneapolis, USA. Association for Computational Linguistics. Lukas Edman, Gabriele Sarti, Antonio Toral, Gertjan van Noord, and Arianna Bisazza. 2023. Are character-level translations worth the wait? comparing character- and subword-level models for machine translation. Pavel Efimov, Andrey Chertok, Leonid Boytsov, and Pavel Braslavski. 2020. Sberquad – russian reading comprehension dataset: Description and analysis. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 3–15. Springer International Publishing. Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji, Pierre Zweigenbaum, and Jun’ichi Tsujii. 2020. CharacterBERT: Reconciling ELMo and BERT for word-level openvocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6903–6915, Barcelona, Spain (Online). International Committee on Computational Linguistics. Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran, Tania Bedrax-Weiss, and Dan Roth. 2019. How large are lions? inducing distributions over quantitative attributes. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3973–3983, Florence, Italy. Association for Computational Linguistics. 116 William Falcon et al. 2019. Pytorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning, 3. Lisa Feigenson, Stanislas Dehaene, and Elizabeth Spelke. 2004. Core systems of number. Trends in cognitive sciences, 8(7):307–314. William Fleshman and Benjamin Van Durme. 2023. Toucan: Token-aware character level language modeling. Maxwell Forbes and Yejin Choi. 2017. Verb physics: Relative physical knowledge of actions and objects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 266–276, Vancouver, Canada. Association for Computational Linguistics. Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. Injecting numerical reasoning skills into language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 946–958, Online. Association for Computational Linguistics. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Nathan Godey, Roman Castagné, Éric de la Clergerie, and Benoît Sagot. 2022. MANTa: Efficient gradient-based tokenization for end-to-end robust language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2859–2870, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Pranav Goel, Shi Feng, and Jordan Boyd-Graber. 2019. How pre-trained word representations capture commonsense physical comparisons. In Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing, pages 130–135, Hong Kong, China. Association for Computational Linguistics. Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955–3964, Online. Association for Computational Linguistics. Thamme Gowda, Zhao Zhang, Chris A Mattmann, and Jonathan May. 2021a. Many-to-english machine translation tools, data, and pretrained models. Thamme Gowda, Zhao Zhang, Chris A. Mattmann, and Jonathan May. 2021b. Many-to-english machine translation tools, data, and pretrained models. CoRR, abs/2104.00290. David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34. Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, 117 Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. 2024. Olmo: Accelerating the science of language models. David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. A closer look at skip-gram modelling. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA). Anne Lundgaard Hansen and Sophia Kazinnik. 2023. Can chatgpt decipher fedspeak. Available at SSRN. Jeff Hawkins. 2021. A thousand brains: A new theory of intelligence. Basic Books. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian error linear units (gelus). Yuncheng Huang, Qianyu He, Jiaqing Liang, Sihang Jiang, Yanghua Xiao, and Yunwen Chen. 2023. Enhancing quantitative reasoning skills of large language models through dimension perception. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. 2020. Learning numeral embedding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2586–2599, Online. Association for Computational Linguistics. Devin Johnson, Denise Mak, Andrew Barker, and Lexi Loessberg-Zahl. 2020. Probing for multilingual numerical understanding in transformer-based language models. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 184–192, Online. Association for Computational Linguistics. Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, and Peter Clark. 2021a. How much coffee was consumed during EMNLP 2019? fermi problems: A new reasoning challenge for AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7318–7328, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, and Peter Clark. 2021b. How much coffee was consumed during emnlp 2019? fermi problems: A new reasoning challenge for ai. arXiv preprint arXiv:2110.14207. 118 Ayush Kaushal and Kyle Mahowald. 2022. What do tokens know about their characters and how do they know it? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2487–2507, Seattle, United States. Association for Computational Linguistics. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization. Hyungjin Ko and Jaewook Lee. 2023. Can chatgpt improve investment decision? from a portfolio management perspective. SSRN Electronic Journal. Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, and Maja Popović. 2022. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pages 79–86. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007a. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007b. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics. Teuvo Kohonen. 1990. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480. Taku Kudo. 2018a. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics. Taku Kudo. 2018b. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 119 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics. Dipesh Kumar and Avijit Thawani. 2022. BPE beyond word boundary: How NOT to use multi word expressions in neural machine translation. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 172–179, Dublin, Ireland. Association for Computational Linguistics. Anoop Kunchukuttan. 2020. The IndicNLP Library. https://github.com/anoopkunchukuttan/ indic_nlp_library/blob/master/docs/indicnlp.pdf. Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. 2018. The IIT Bombay EnglishHindi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039–5049, Brussels, Belgium. Association for Computational Linguistics. Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, and Yoav Shoham. 2021. {PMI}-masking: Principled masking of correlated spans. In International Conference on Learning Representations. Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xiang Ren. 2020. Numersense: Probing numerical commonsense knowledge of pre-trained language models. Alejandro Lopez-Lira. 2023. Risk factors that matter: Textual analysis of risk disclosures for the cross-section of returns. Jacobs Levy Equity Management Center for Quantitative Financial Research Paper. Sabrina J Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y Lee, Benoît Sagot, et al. 2021a. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. arXiv preprint arXiv:2112.10508. Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. 2021b. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Marvin Minsky. 1988. Society of mind. Simon and Schuster. Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, and Chitta Baral. 2020. Towards question format independent numerical reasoning: A set of prerequisite tasks. 120 Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon Mathialagan, Chengyuan Ma, and Chenlei Guo. 2022. A vocabulary-free multilingual neural tokenizer for end-to-end task learning. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 91–99, Dublin, Ireland. Association for Computational Linguistics. Aakanksha Naik, Abhilasha Ravichander, Carolyn Rose, and Eduard Hovy. 2019. Exploring numeracy in word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3374–3380, Florence, Italy. Association for Computational Linguistics. Mikhail Nefedov. 2020. Dataset for evaluation of mathematical reasoning abilities in russian. In Conference on Artificial Intelligence and Natural Language, pages 135–144. Springer. Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, Xinyi Wang, and John Wieting. 2019. compare-mt: A tool for holistic comparison of language generation systems. CoRR, abs/1903.07926. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. Rodrigo Nogueira, Zhiying Jiang, and Jimmy Li. 2021. Investigating the limitations of the transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019. OpenAI. 2023. Gpt-4 technical report. Chan Young Park and Yulia Tsvetkov. 2019. Learning to generate word- and phrase-embeddings for efficient phrase-based neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 241–248, Hong Kong. Association for Computational Linguistics. Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are nlp models really able to solve simple math word problems? Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. 121 Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018b. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237. Rene Pickhardt, Thomas Gottron, Martin Körner, Paul Georg Wagner, Till Speicher, and Steffen Staab. 2014. A generalized language model as the combination of skipped n-grams and modified Kneser Ney smoothing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1145–1154, Baltimore, Maryland. Association for Computational Linguistics. Yuval Pinter, Amanda Stent, Mark Dredze, and Jacob Eisenstein. 2021. Learning to look inside: Augmenting token-based encoders with character-level information. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics. Matt Post. 2018a. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics. Matt Post. 2018b. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics. Ofir Press and Lior Wolf. 2016. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859. Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics. Michal Ptaszynski, Fumito Masui, Rafal Rzepka, and Kenji Araki. 2014. First glance on patternbased language modeling. Language Acquisition and Understanding Research Group Technical Reports. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250. 122 Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. 2019. NumNet: Machine reading comprehension with numerical reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2474–2484, Hong Kong, China. Association for Computational Linguistics. Samuel Reese, Gemma Boleda, Montse Cuadros, Lluís Padró, and German Rigau. 2010. Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA). Mat¯ıss Rikters and Ondřej Bojar. 2017. Paying attention to multi-word expressions in neural machine translation. In Proceedings of Machine Translation Summit XVI: Research Track, pages 86–95, Nagoya Japan. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics. Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. 2022. Language modelling with pixels. Elizabeth Salesky, David Etter, and Matt Post. 2021. Robust open-vocabulary translation from visual text representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7235–7252, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. David Samuel and Lilja Øvrelid. 2023. Tokenization with factorized subword encoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14143–14161, Toronto, Canada. Association for Computational Linguistics. David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. 2019. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016c. Neural machine translation of rare words with subword units. 123 Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, and Maosong Sun. 2021. Shuowen-jiezi: Linguistically informed tokenizers for chinese language model pretraining. arXiv preprint arXiv:2106.00400. Chenglei Si, Zhengyan Zhang, Yingfa Chen, Fanchao Qi, Xiaozhi Wang, Zhiyuan Liu, Yasheng Wang, Qun Liu, and Maosong Sun. 2023. Sub-Character Tokenization for Chinese Pretrained Language Models. Transactions of the Association for Computational Linguistics, 11:469–487. Georgios P. Spithourakis and Sebastian Riedel. 2018. Numeracy for language models: Evaluating and improving their ability to predict numbers. CoRR, abs/1805.08154. Daniel Spokoyny, Ivan Lee, Zhao Jin, and Taylor Berg-Kirkpatrick. 2022a. Masked measurement prediction: Learning to jointly predict quantities and units from textual context. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 17–29, Seattle, United States. Association for Computational Linguistics. Daniel Spokoyny, Chien-Sheng Wu, and Caiming Xiong. 2022b. Numerical correlation in text. In Proceedings of the 1st Workshop on Mathematical Natural Language Processing (MathNLP), pages 33–39, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu-Jie Cheng, and Junjie Hu. 2022. Local byte fusion for neural machine translation. ArXiv, abs/2205.11490. Dhanasekar Sundararaman, Shijing Si, Vivek Subramanian, Guoyin Wang, Devamanyu Hazarika, and Lawrence Carin. 2020. Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4742–4753, Online. Association for Computational Linguistics. Richard Sutton. 2019. The bitter lesson. Incomplete Ideas (blog), 13(1):38. John Sylak-Glassman. 2016. The composition and use of the universal morphological feature schema (unimorph schema). Yi Tay, Vinh Q Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, and Donald Metzler. 2021. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. Avijit Thawani, Saurabh Ghanekar, Xiaoyuan Zhu, and Jay Pujara. 2023a. Learn your tokens: Wordpooled tokenization for language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9883–9893. 124 Avijit Thawani, Jay Pujara, and Filip Ilievski. 2021a. Numeracy enhances the literacy of language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6960–6967, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Avijit Thawani, Jay Pujara, Filip Ilievski, and Pedro Szekely. 2021b. Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–656, Online. Association for Computational Linguistics. Avijit Thawani, Jay Pujara, Filip Ilievski, and Pedro Szekely. 2021c. Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–656, Online. Association for Computational Linguistics. Avijit Thawani, Jay Pujara, and Ashwin Kalyan. 2023b. Estimating numbers without regression. arXiv preprint arXiv:2310.06204. Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces, volume V, pages 237–248. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023b. Llama 2: Open foundation and fine-tuned chat models. Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. 2018. Neural arithmetic logic units. In Advances in Neural Information Processing Systems, pages 8035–8044. Janneke van der Zwaan, Dafne van Kuppevelt, Maksim Abdul Latif, Melle Lyklema, and Christian Lange. 2019. arabic-digital-humanities/root-extraction- validation-data: 0.1.0. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30. 125 Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner. 2019. Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5307–5315, Hong Kong, China. Association for Computational Linguistics. Xing Wang, Zhaopeng Tu, Deyi Xiong, and Min Zhang. 2017. Translating phrases in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1421–1431, Copenhagen, Denmark. Association for Computational Linguistics. Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2021. Multi-view subword regularization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 473–482, Online. Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics. Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic, and Andrei Popescu-Belis. 2023. Assessing the importance of frequency versus compositionality for subword-based tokenization in nmt. Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016a. Google’s neural machine translation system: Bridging the gap between human and machine translation. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016b. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Qianqian Xie, Weiguang Han, Yanzhao Lai, Min Peng, and Jimin Huang. 2023. The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges. arXiv preprint arXiv:2304.07359. 126 Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306. Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis. 2023. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185. Andrea Zaninello and Alexandra Birch. 2020. Multiword expression aware neural machine translation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3816–3825, Marseille, France. European Language Resources Association. Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan Roth. 2020. Do language embeddings capture scales? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4889–4896, Online. Association for Computational Linguistics. Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2020. Temporal common sense acquisition with minimal supervision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7579–7589, Online. Association for Computational Linguistics. Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3530–3534, Portorož, Slovenia. European Language Resources Association (ELRA). 127
Abstract (if available)
Abstract
Natural language is a sequence of symbols. Language models (LMs) are powerful at learning sequence patterns. The first step for large language models (LLMs) like ChatGPT is to convert text (that humans understand) into symbolic codes (that models do). This crucial phase in the Language Modeling pipeline has unfortunately been understudied and is currently achieved by subword segmentation, a manually engineered set of heuristics. I scrutinize case studies where these heuristics fail and my recommended improvements: for example when representing numbers, multi-word expressions, and non-concatenative languages. I present an end-to-end tokenized language model that understands both words and numbers better than subwords without any manually engineered heuristics. This model also outperforms character-level tokenization, promising up to 4x speed up in inference and 6x speed up in training.
I show the benefits of aggregating symbols for language modeling, and investigate key aspects of symbol use in LMs:
- Aggregating on the number line improves both numeracy and literacy of language models
- We can learn to aggregate symbols given a corpus with improved language modeling and approximate numeracy
- Learning to aggregate symbols helps downstream performance in certain application areas like neural machine translation of non-concatenative languages
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Towards more human-like cross-lingual transfer learning
PDF
Building generalizable language models for code processing
PDF
Lexical complexity-driven representation learning
PDF
Identifying and mitigating safety risks in language models
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Computational models for multidimensional annotations of affect
PDF
Common ground reasoning for communicative agents
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Robust and proactive error detection and correction in tables
PDF
Towards generalized event understanding in text via generative models
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Learning at the local level
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Bridging the visual reasoning gaps in multi-modal models
Asset Metadata
Creator
Thawani, Avijit
(author)
Core Title
Aggregating symbols for language models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
08/13/2024
Defense Date
08/12/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attention,BPE,efficient NLP,language modeling,large language models,machine translation,multilingual,natural language processing,NLP,NLU,numeracy,subword vocabulary,tokenization,Transformer
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pujara, Jay (
committee chair
), Hoberg, Gerard (
committee member
), Nakano, Aiichiro (
committee member
), Swayamdipta, Swabha (
committee member
), Yogatama, Dani (
committee member
)
Creator Email
avijit.thawani@gmail.com,thawani@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998TFR
Unique identifier
UC113998TFR
Identifier
etd-ThawaniAvi-13384.pdf (filename)
Legacy Identifier
etd-ThawaniAvi-13384
Document Type
Dissertation
Format
theses (aat)
Rights
Thawani, Avijit
Internet Media Type
application/pdf
Type
texts
Source
20240813-usctheses-batch-1197
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
attention
BPE
efficient NLP
language modeling
large language models
machine translation
multilingual
natural language processing
NLP
NLU
numeracy
subword vocabulary
tokenization
Transformer