Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neural sequence models: Interpretation and augmentation
(USC Thesis Other)
Neural sequence models: Interpretation and augmentation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NEURAL SEQUENCE MODELS: INTERPRETATION AND AUGMENTATION by Xing Shi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Xing Shi Dedication To myself, who struggled, switched, settled, suspected, sustained and survived, within 6 years. 2 Acknowledgement It is a great fortune to work with my PhD. advisor, Kevin Knight, who shaped me in the past 5 years. I learned 3 important skills from him: 1) Choose the right research direction from either the real practical application or your inner curiosity; 2) Shape, develop and communicate research ideas using a handout; 3) When failure, nd the bug, communicate with others or switch fast. Kevin provides opportunity: he sent me to the deep learning for machine translation winter school. He encourages me to explore: I stepped on a short start-up journey with my rst research paper. He spends enormous time and eorts on his student: he told me at the rst day we met, "You'll need six years, you'd better not quit. I spend lots of time on students." He is a great coach: he provides detailed comments and instructions, and sets high expectation internally to stimulate me to move forward. No words can express my thankfulness for his endless support and mentoring. I would also like to acknowledge my other committee members: Jonathan May, Shri Narayanan, Andrew Gordon and Kallirroi Georgila for their insightful com- ments and feedback. My PhD journey wouldn't have been successful without collaborations and dis- cussion with many people including but not limited to: Nima Pourdamghani, Aliya Deri, Ashish Vaswani, Tomer Levinboim, Ulf Hermjakob, Jonathan May, Daniel Marcu, Kenji Sagae, Inkit Padhi, Yejin Choi, Minlie Huang, Jay Priyadarshi, Heng 3 Ji, Deniz Yuret, Linhong Zhu and Yi Wei. I want to especially thank my labmate, Marjan Ghazvininejad, for the great collaboration experience and joyfulness of win- ing multiple awards together, as well as my 10-years friend, landlord, gym partner, cooking partner, Kuan Liu for the wonderful collaboration experience and all the joy of PhD life. I also like to express my gratitude to USC/ISI sta for their tremendous support: Peter Zamar, Kary Lau, Alma Nava and Lizsl De Leon. I wish to deeply thank my parents Haiyong Shi and Qiuhong Xie. Without their endless love and unconditional support, I would not have made such great achievement. Finally, my special thanks to my beloved wife, Na Ding, who under- stood, supported, encouraged me to get through this adventure and made this journey colorful and memorable. 4 Contents Dedication 2 Acknowledgement 3 List of Tables 8 List of Figures 10 Abstract 13 1 Introduction 15 1.1 Neural Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . 15 1.1.1 Sequence to Sequence Model . . . . . . . . . . . . . . . . . . 17 1.1.2 Training Algorithm: Backward Propagation Through Time . 20 1.1.3 Decoding Algorithm: Beam Search . . . . . . . . . . . . . . 21 1.2 Open up the Black Box . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3 Battle against the Huge Vocabulary Set . . . . . . . . . . . . . . . . 25 1.4 Guide the Generation by External Knowledge . . . . . . . . . . . . 27 1.5 Quench the Data Thirst . . . . . . . . . . . . . . . . . . . . . . . . 28 1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2 Interpretation: Length Control Mechanism in Sequence-to-sequence Model 32 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 A Toy Problem for Neural MT . . . . . . . . . . . . . . . . . . . . . 35 2.3 Full-Scale Neural Machine Translation . . . . . . . . . . . . . . . . 38 2.4 Mechanisms for Decoding . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Interpretation: Encoder Has Grasped Syntactic Information 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 3.4 Datasets and models . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Syntactic Label Prediction . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Extract Syntactic Trees from Encoder . . . . . . . . . . . . . . . . . 53 3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Augmentation: Speed Up Decoding using Word Alignment 61 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Word Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.2 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . 65 4.2.3 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Augmentation: Speed Up Decoding using Locality Sensitive Hash- ing 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 GPU Computing Concept . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Warp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.3 Latency Hiding . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.1 LSH on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.2 LSH on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Augmentation: Constrain the Beam Search with Finite State Acceptor 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Selecting Topically Related Rhyme Pairs . . . . . . . . . . . . . . . 97 6.3.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3.2 Related Words and Phrases . . . . . . . . . . . . . . . . . . 97 6.3.3 Choosing Rhyme Words . . . . . . . . . . . . . . . . . . . . 98 6.4 Constructing FSA of Possible Poems . . . . . . . . . . . . . . . . . 98 6.5 Path extraction through FSA with RNN . . . . . . . . . . . . . . . 100 6 6.6 Style Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.7 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.8.1 Quality analysis . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.8.2 Human-Computer Collaboration . . . . . . . . . . . . . . . . 106 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7 Augmentation: Neural Machine Translation for Low-resource Lan- guages 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Sub-word Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Regularization for Training . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.1 Dropout on RNN . . . . . . . . . . . . . . . . . . . . . . . . 113 7.3.2 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Rare Word Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Inadequate Translation . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.6.2 The Baseline Systems . . . . . . . . . . . . . . . . . . . . . . 118 7.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.6.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 118 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8 Conclusion 121 Reference List 123 7 List of Tables 1.1 The event of interest E and the corresponding conditional event F of dierent NLP tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 R 2 values showing how dierently-chosen sets of 10 LSTM hidden units correlate with length in the NMT encoder. . . . . . . . . . . . 39 2.2 Sets of k units chosen by beam search to optimally track length in the NMT encoder. These units are from the LSTM's second layer. . 40 3.1 Voice (active/passive) prediction accuracy using the encoding vector of an NMT system. The majority class baseline always chooses active. 44 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Labeled F1-scores of dierent parsers on WSJ Section 23. The F1- score is calculated on valid trees only. . . . . . . . . . . . . . . . . . 48 3.4 Corpus statistics for ve syntactic labels. . . . . . . . . . . . . . . 50 3.5 Perplexity, labeled F1-score, and Tree Edit Distance (TED) of var- ious systems. Labeled F1-scores are calculated on EVALB-trees only. Tree edit distances are calculated on the well-formed trees only. EVALB-trees are those whose number of leaves match the number of words in the source sentence, and are otherwise accepted by standard Treebank evaluation software. . . . . . . . . . . . . . . 55 3.6 Labeled F1-scores and POS tagging accuracy on the intersection set of EVALB-trees of dierent parsers. There are 569 trees in the intersection, and the average length of corresponding English sentence is 12.54. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Time breakdown and BLEU score of full vocabulary decoding and our \WA50" decoding, both with beam size 12. WA50 means decod- ing informed by word alignments, where each source word can select at most 50 relevant target words. The model is a 2-layer, 1000- hidden dimension, 50,000-target vocabulary LSTM seq2seq model with local attention trained on the ASPEC Japanese-to-English cor- pus (Nakazawa et al., 2016). The time is measured on a single Nvidia Tesla K20 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8 4.2 Word type coverage (TC), BLEU score, and speedups (X) for full- vocabulary decoding (Full), top frequency vocabulary decoding (TF*), LSH decoding (LSH*), and decoding with word alignments (WA*). TF10K represents decoding with top 10,000 frequent target vocab- ulary (C = 10; 000). WA10 means decoding with word alignments, where each source word can select at most 10 candidate target words (M = 10). For LSH decoding, we choose (32, 5000, 1000) for (K,P ,W ), and vary C. . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Training congurations on dierent language pairs. . . . . . . . . . 70 5.1 Time breakdown, runtime vocabulary size, and BLEU score of full vocabulary decoding and LSH decoding. The model is a 2-layer, 1000-hidden dimension, 40000 target vocabulary LSTM seq2seq model trained on a French to English corpus (Bojar et al., 2014). The experiments are conducted on a Nvidia K20 GPU and a single-core 2.4GHz Intel Xeon CPU. The code is compiled against CUDA 8.0. . 74 5.2 Comparison of speedup methods. . . . . . . . . . . . . . . . . . . . 77 5.3 The time consumption and oating point operations per second(G op/s) of matrix multiplication on GPU at dierent scales. . . . . . . . . 82 5.4 The runtime vocabulary size and time breakdown of each step of full vocabulary decoding and our LSH decoding on translating a French sentence to an English sentence with beam size 12. The last column means the slowdown if the corresponding optimized step is replaced by a naive implementation. . . . . . . . . . . . . . . . . . . . . . . 83 5.5 The running example of WTA hash with W = 2, u = 2 and K = 2. 84 5.6 Training congurations of dierent language pairs. The attention model is based on Luong et al. (2015). Data sources: ASPEC Japanese-English Corpus (Nakazawa et al., 2016), French-English Corpus from WMT2014 (Bojar et al., 2014), and Uzbek-English Corpus (Consortium, 2016). . . . . . . . . . . . . . . . . . . . . . . 89 5.7 The speedup and BLEU loss of LSH decoding over full-vocabulary decoding at dierent beam sizes on French-to-English translation. . 91 7.1 The BLEU score of neural MT and syntax-based MT system from Uyghur to English and German to English (Luong et al., 2015) . . 110 7.2 Statistics of Uyghur-English corpus . . . . . . . . . . . . . . . . . . 117 7.3 BLEU scores on test set of dierent Uyghur-English translation sys- tem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9 List of Figures 1.1 The structure of the sequence to sequence model. The input sequence is [f 1 ;f 2 ;f 3 ] and the output sequence is [e 1 ;e 2 ;e 3 ].The white and grey rectangles represents LSTM cells with two dierent sets of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 The encoder-decoder framework for neural machine translation (NMT) (Sutskever et al., 2014b). Here, a source sentence C B A (fed in reverse as A B C) is translated into a target sentence W X Y Z. At each step, an evolving real-valued vector summarizes the state of the encoder (left half) and decoder (right half). . . . . . . . . . . 33 2.2 After learning, the recurrent network can convert any string of a's and b's into a 4-dimensional vector. The left plot shows the encoded strings in dimensions described by the cell states of LSTM unit 1 (x- axis) and unit 2 (y-axis). unit 1 learns to record the length of the string, while unit 2 records whether there are more b's than a's, with a +1 bonus for strings that end in a. The right plot shows the cell states of LSTM unit 3 (x-axis) and unit 4 (y-axis). unit 3 records how many a's the string begins with, while unit 4 correlates with both length and the preponderance of b's. Some text labels are omitted for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3 The progression of LSTM state as the recurrent network encodes the string \a b a b b b". Columns show the inputs over time and rows show the outputs. Red color indicates positive values, and blue color indicates negative. The value of unit 1 decreases during the encoding phase (top gure) and increases during the decoding phase (middle gure). The bottom gure shows the decoder's probability of ending the target string (<EOS>). . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Activation of translation unit 109 and unit 334 during the encoding and decoding of a sample sentence. Also shown is the softmax log-prob of output <EOS>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Sample inputs and outputs of the E2E, PE2PE, E2F, E2G, and E2P models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 10 3.2 The ve syntactic labels for sentence \This time , the rms were ready". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Prediction accuracy of ve syntactic labels on test. Each syntactic label is predicted using both the lower-layer cell states (C0) and higher-layer cell states (C1). For each cell state, we predict each syntactic label using all 1000 units (All), as well as the top 10 units (Top10) selected by recursive feature elimination. The horizontal blue line is the majority class accuracy. . . . . . . . . . . . . . . . . 51 3.4 E2F and E2F2P share the same English encoder. When training E2F2P, we only update the parameters of linearized tree decoder, keeping the English encoder's parameters xed. . . . . . . . . . . . 53 3.5 For model E2P (the red bar), we show the average number of bracket errors per sentence due to the top 11 error types. For other models, we show the ratio of each model's average number of bracket errors to that of model E2P . Errors analyzed on the intersection set. The table is sorted based on the ratios of the E2F 2P model. . 58 3.6 Example of Sense Confusion. The POS tag for word \beyond" is predicted as \RB" instead of \IN", resulting in a missing preposi- tional phrase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Illustration of kernel, warp and warp divergence. The solid line means the thread is active, and the dashed line means the thread is idle. Because of the branch, the rst half of the warp will execute instruction A and be idle when the other half executes instruction B. 78 5.2 Comparison of the pipeline of full vocabulary softmax, LSH softmax on CPU proposed in Vijayanarasimhan et al. (2015), and our LSH softmax on GPU. Every step of full vocabulary softmax and our LSH softmax on GPU is executed in batch mode, whereas the steps inside the grey box of LSH softmax on CPU are executed separately for each hidden vector in the beam. . . . . . . . . . . . . . . . . . . 80 5.3 Example of cuckoo lookup. The beam size is 1, W = 2 andjVj = 6. 85 5.4 Illustration of naive implementation and optimized implementation of line 3-6 in Algorithm 2. We assume each warp contains 4 threads, and their for-loop lengths are 1, 7, 3, and 2. The round grey rect- angle represents one step of a warp. (a) The naive implementation, which takes the warp 7 steps to nish. (b) The optimized imple- mentation, which takes only 5 steps. . . . . . . . . . . . . . . . . . 87 11 5.5 Illustration of optimized stream compaction algorithm. We assume each warp contains 4 threads here. 2 warps will rst load L into shared memory in a coalesced read. Then only the rst thread of each warp will scan the 4 values and lter out the valid word ID. Then each warp will write the valid word ID back in a coalesced write. The start position in V LSH for each warp is maintained in global memory, omitted here. . . . . . . . . . . . . . . . . . . . . . 88 5.6 BLEU/speedup curve of 3 decoding methods on 4 translation direc- tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 The BLEU/speedup curve for dierentT on French to English trans- lation model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1 Overview of Hafez converting a user-supplied topic word (wedding) into a four-line iambic pentameter stanza. . . . . . . . . . . . . . . 95 6.2 An FSA compactly encoding all word sequences that obey formal sonnet constraints, and dictating the right-hand edge of the poem via rhyming, topical words delight, chance, ... and joy. . . . . . . . 99 6.3 Sample sonnet generated from the topic phrase bipolar disorder. . . 106 6.4 Sample stanzas generated from dierent topic phrases. . . . . . . . 107 7.1 The ratio of non-UNK tokens of Uyghur/English train/dev set when selecting dierent number of top frequent words as vocabulary. . . . 113 7.2 Depiction of three dierent dropout techniques on RNN. Each block represents a RNN unit. Vertical arrows represent feed-forward con- nections between layers, and Horizontal arrows represent recurrent connections between steps. The dashed arrow means a dropout mask is applied on the connection, and solid arrow means there's no dropout. Dierent color means dierent sampled dropout mask. 114 12 Abstract Recurrent neural networks (RNN) have been successfully applied to various Natural Language Processing tasks, including language modeling, machine translation, text generation, etc. However, several obstacles still stand in the way: First, due to the RNN's distributional nature, few interpretations of its internal mechanism are obtained, and it remains a black box. Second, because of the large vocabulary sets involved, the text generation is very time-consuming. Third, there is no exible way to constrain the generation of the sequence model with external knowledge. Last, huge training data must be collected to guarantee the performance of these neural models, whereas annotated data such as parallel data used in machine translation are expensive to obtain. This work aims to address the four challenges mentioned above. To further understand the internal mechanism of the RNN, we choose neural machine translation (NMT) systems as a testbed. We rst investigate how NMT outputs target strings of appropriate lengths, locating a collection of hidden units that learns to explicitly implement this functionality. Then we investigate whether NMT systems learn source language syntax as a by-product of training on string pairs. We nd that both local and global syntactic information about source sentences is captured by the encoder. Dierent types of syntax are stored in dierent layers, with dierent concentration degrees. 13 To speed up text generation, we propose two novel GPU-based algorithms: 1) Utilize the source/target words alignment information to shrink the target side run- time vocabulary; 2) Apply locality sensitive hashing to nd nearest word embed- dings. Both methods lead to a 2-3x speedup on four translation tasks without hurting machine translation accuracy as measured by BLEU. Furthermore, we integrate a nite state acceptor into the neural sequence model during generation, providing a exible way to constrain the output, and we successfully apply this to poem generation, in order to control the meter and rhyme. To improve NMT performance on low-resource language pairs, we re-examine multiple technologies that are used in high resource language NMT and other NLP tasks, explore their variations and result in a strong NMT system for low resource languages. Experiments on Uygher-English show 10.4 BLEU score improvement over the vanilla NMT system, and achieve comparable results with syntax-based machine translation. 14 Chapter 1 Introduction 1.1 Neural Sequence Models The neural sequence model has achieved great success on various NLP tasks, including language modeling, machine translation, text generation, etc. However, it is still far from perfect: First, given a trained model, few interpretations of its internal mechanism can be obtained, due to the distributional representations used inside|thus it still remains a black box. Second, during the text generation stage, the searching algorithm can be very time-consuming because of the large vocabulary, and we lack a exible way to constrain the generation with certain external knowledge. Last, to estimate the parameters of the sequence model, huge annotated data need to be collected to guarantee the performance, whereas these data (such as parallel sentence pairs used in machine translation) are expensive to obtain. The research goal of this dissertation is to address the above-mentioned challenges during parameter estimation stage and text generation stage, and to provide a better interpretation to enhance the understanding of neural sequence models. While human languages can convey complex semantic meanings using various syntactic structures, they are essentially always in the form of sequences of words. Thus, modeling sequences becomes one of the core missions of Nature Language 15 NLP Task E F Sentence Classication Discrete classes Sequence of words Part-Of-Speech Tagging Sequence of POS tags Sequence of words Machine Translation Sequence of words Sequence of words Topical Poem Generation Sequence of words Sequence of topic words Image Caption Sequence of words Image Language Model Sequence of words Null Table 1.1: The event of interest E and the corresponding conditional event F of dierent NLP tasks. Processing (NLP). Mathematically, given the input F and output E, the goal of a sequence model is to calculate the following conditional distribution: P (EjF) (1.1) where either E or F is a sequence of symbols, or both of them are sequences. A large number of NLP tasks can be formulated by Equation 1.1 given dierent choices of E or F, as shown in Table 1.1. Over the years, several types of sequence models have been proposed and achieved great success: 1) The generative models, represented by Hidden Markov Models (HMM) (Baum and Petrie, 1966), model the joint probability of the two sequences P (E; F); 2) The discriminative models, represented by the linear-chain Conditional Random Field (CRF) (Laerty et al., 2001), model the conditional probabilityP (EjF) directly; 3) The nite-state machines (Turing, 1937), including both Weighted Finite-State Acceptor (WFSA) and Weighted Finite-State Trans- ducer (WFST), are computation devices that can either accept a sequence or con- vert a sequence to another sequence; 4) Some ad-hoc models designed specically for certain tasks, like the Phrase-based model (Koehn et al., 2003) for Statistical Machine Translation (SMT). 16 However, these models still suer from three major issues: First, the repre- sentational power of the features used in these models is usually limited. The features are usually categorical features which are represented by sparse one-hot vectors. The distance between these features are always one, thus can not re ect their internal relationship. Second, to grasp the relationship between two features that are far away in the sequence, the complexity of these models will increase exponentially. Third, some complex system, such as Phrase-based Machine Trans- lation, is a hybrid of several components (phrase extraction and language model), thus those decoupled components can not be trained in an end-to-end fashion. The neural sequence models successfully solve the above-mentioned three issues. We will brie y review the structure of the neural sequence models and related algorithms for readers to better understand the rest of this dissertation. 1.1.1 Sequence to Sequence Model The most generic form of such neural sequence models is the sequence to sequence (seq2seq) model (Sutskever et al., 2014a), whose structure is shown in Figure 1.1. It utilizes the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997b) structure as the building block. It consists of two LSTMs: the encoder which reads the source sequence F = f M 1 one by one to obtain a xed dimension vector representation, i.e. the intermediate vector, and the decoder which generates the target sequence E =e N 1 one by one out of the intermediate vector. Formally, it decomposes the sequence conditional distribution as follows: P (EjF) = N Y i=1 p(e i je i1 1 ;f M 1 ) (1.2) 17 Figure 1.1: The structure of the sequence to sequence model. The input sequence is [f 1 ;f 2 ;f 3 ] and the output sequence is [e 1 ;e 2 ;e 3 ].The white and grey rectangles represents LSTM cells with two dierent sets of parameters. One can easily modify the seq2seq model for Image Caption task by using a Con- volution Neural Network (CNN) as the encoder (Xu et al., 2015), or for Language Model by removing the encoder and using zero vectors as the intermediate vector (Mikolov et al., 2010). In more detail, at the bottom of the encoder, there is a embedding lookup layer. Its only parameter is a matrix W F 2 R jV F jd , where V E is the source vocabulary andd is the embedding size. The embedding lookup layer will convert word symbol f i into a d-dimensional vector w f i 2 R d . Usually, f i is represented as an integer and w f i is the f i -th row of the matrix W F . After converting the word into embeddings, the encoder uses a LSTM cell to calculate the hidden state h F i 2R d at the each step i recursively: h F i ;c F i =LSTM(h F i1 ;c F i1 ;w f i1 ) (1.3) h F 0 ;c F 0 = 0; 0 (1.4) Finally, the encoder will convert the whole source sequence E =e N 1 to two xed d-dimensional vector h F M ;c F M . 18 At the bottom of the decoder, there is also a embedding lookup layer with its parameter W E 2 R jV E jd . It will convert each target word e i into d-dimensional embedding w e i 2 R d . Another LSTM cell will calculate the hidden state h e i recursively: h E i ;c E i =LSTM(h E i1 ;c E i1 ;w e i1 ) (1.5) h E 0 ;c E 0 =h F M ;c F M (1.6) At each step of decoder, we need to calculate p(e i je i1 1 ;f M 1 ), the conditional probability of the current target word e i given all previous target words e i1 1 and all source words f M 1 . We rst calculate a score s(k) for each word k2 V E using h E i : s(k2V E ) =o T k h E i +b k (1.7) where o k 2 R d is the output word embedding for word k and b k 2 R is the corresponding bias. They are both parameters that need to be estimated. Then we normalize these scores using softmax function to get the conditional probability: p(e i je i1 1 ;f M 1 ) = e s(e i ) P k2V E e s(k) (1.8) 19 1.1.2 Training Algorithm: Backward Propagation Through Time The parameters of a seq2seq model include the parameters of encoder LSTM and decoder LSTM, the input word embeddings for both source and target vocab- ularies, and the output word embeddings and bias for target vocabulary. To esti- mate the values of these parameters, we will rst need a set of parallel sequences D =f(F k ; E k )jk = 1;:::;ng as our training data. Then the training objective function can be formulated as follows: J(;D) = n X k=1 logp(E k jF k ) (1.9) To optimize the objective function, one can apply the algorithm of backward propagation through time (BPTT) (Mozer, 1989). As dierent time steps t actually share the same parameters , we could view them as dierent instances t . BPTT will rst calculate the gradients @J @t using the traditional backward propagation algorithm. Then sum them up to get the nal gradients: @J @ = X t @J @ t (1.10) After we get the gradients, we can simply utilize the stochastic gradient decent (SGD) algorithm to update our parameters: = @J @ (1.11) 20 where is the learning rate, and it will gradually shrink during training until convergence. One can also choose other advanced optimization algorithms like Adagrad (Duchi et al., 2011) and Adam (Kingma and Ba, 2014). 1.1.3 Decoding Algorithm: Beam Search During the training stage, we estimate the parameters of the seq2seq model with a parallel training corpus. During the decoding stage, our goal is to nd the most likely target sequence ^ E given the source sequence F: ^ E = arg max E logp(EjF) (1.12) = arg max e N 1 N X i logp(e i je i1 1 ;f M 1 ) (1.13) In theory, there arejVj N possible target sequences. To reduce the time and memory requirements, we apply beam search (shown in Algorithm 1) to approx- imate the best target sequence. Given the beam size B, the complexity of beam search isO(NBjVj). In summary, the internal high dimensional representations inside the LSTMs can be viewed as the features extracted/constructed by the model. Such distribu- tional representations and the multiple non-linear transformations greatly increase the exibility and power of the neural sequence models, also enabling the models to reveal the relationships between two features that are far apart in the sequence. Thanks to back-propagation through time (BPTT), the whole seq2seq model can be trained in the end-to-end fashion. 21 Algorithm 1 Beam search Inputs: beam size B, target vocabulary V , logp(e i je i1 1 ;f M 1 ) Output: ^ E 1: Initialize B tuples (E b ,score b ) with E b = [] and score b = 0 2: for i = 1 to N do 3: Heap<entry;value> h = [] 4: for b = 1 to B do 5: select top B e i according to logp(e i jE b ;f M 1 )) 6: push B tuples ([E b ,e i ], logp(e i jE b ;f M 1 ) +score b )) into h. 7: end for 8: select top B tuples from h according to values:f(entry i ;value i )ji = 1;:::;Bg 9: for b = 1 to B do 10: E b =entry b , score b =value b 11: end for 12: end for 13: ^ E =E 1 1.2 Open up the Black Box Despite the great power of the Deep Neural Networks (DNN) and their extraordi- nary boosting on various application, it has been criticized as a \black box" (Alain and Bengio, 2016) due to the lack of a sound theoretical support or a compre- hensive understanding of the internal organization. One of the underlying reasons why it is hard to interpret is the high dimensional vectors DNNs used internally. Unlike the traditional graphical model, these vectorized variables have no prede- ned semantic meanings, and they are assumed to be the features extracted by the model under the model's space, which is probably not aligned with human's semantic space. Researchers in the eld of computer vision (CV) pioneered visualization of DNNs, especially the Convolutional Neural Network (CNN). The general idea is to convert the internal representations back into the input space (the raw pixels) via Inversion (Mahendran and Vedaldi, 2015), Deconvolutional Networks (Zeiler 22 and Fergus, 2014) or using the Evolutionary Algorithm (Nguyen et al., 2015). They reveal that the lower layers will extract low-level features like simple edge lters, and the neurons in the higher layers tend to represent the high-level features like human face contours. Visualizing the neural networks in NLP, especially the Recurrent Neural Net- work, introduces further challenges. One essential dierence between NLP and CV is that they have dierent form of raw inputs: The input for CV is usually raw images consisting of pixels in real value, whereas the input for NLP is usually sequences of discrete words. Thus it is much more challenging to convert the real- valued hidden vectors back to the discrete input space. On the other hand, the words sequences in NLP also contain rich structure patterns which are critical for both syntactic and semantic representations, and further increases the diculty to interpret the RNNs. Several existing works try to interpret RNNs. Karpathy et al. (2016) use a character-level LSTM language model as a test-bed and nd several activation cells that track long-distance relationships, such as line lengths and quotes. They mainly try to interpret the RNN by drawing the activation heat map. They do not touch the seq2seq model nor other complicated patterns reside inside the sequence. Li et al. (2016) explore the syntactic behavior of an RNN-based sentiment ana- lyzer, including the compositionality of negation, intensication, and concessive clauses, by plotting a 60-dimensional heat map of hidden unit values. However, this method potentially restricts them to small scale models. They also introduce a rst-order derivative based method to measure each word's contribution to the nal decision. Several other works try to build a good distributional representa- tion of sentences or paragraph (Socher et al., 2013; Kalchbrenner et al., 2014; Kim, 23 2014; Zhao et al., 2015; Le and Mikolov, 2014; Kiros et al., 2015). They implic- itly verify the claimed syntactic/semantic properties of learned representations by applying them to downstream classication tasks such as sentiment analysis, sentence classication, semantic relatedness, paraphrase detection, image-sentence ranking, question-type classication, etc. The verication is indirect and doesn't re ect the internal organizations of those hidden states. Compared to above works, the novelty of the interpretation in this dissertation includes: 1) We interpret the more complicated seq2seq model. 2) We investigate how sophisticated patterns, such as syntax properties of the sentence, are embed- ded in the hidden states. 3) Our proposed interpretation methods are suitable for large scale models. In this dissertation, I will gradually open up the black box by investigate various properties of the sequence to sequence model, using the Neural Machine Transla- tion (NMT) system as the testing bed. One of the remarkable features of the current NMT system is that it produces translations at the right length, whereas the phrase-based machine translation has to reply on specic designed features and heavy beam search to achieve the same goal. By plotting the activations of all the neurons at each step of a simple 4 dimension seq2seq model, I nd that one neuron is responsible for the length control. When switching to the real-word large scale NMT models, I t the word position in the sequence with all possible neuron values at each step, and similarly, I locate the specic neurons which are responsible for the length control. Next, I switch to investigate the higher level properties of the sequence, and ask two questions to the NMT system: Does the encoder learn syntactic information about the source sentence? What kind of syntactic information is learned, and how much? First, I propose to use beam search to nd the subset of neurons 24 whose linear combinations can best predict the external syntactic labels at both word level and sentence level. Second, to measure how much the whole encoder encodes the syntactic information, I extract the whole constituency tree of the source sentence from the NMT encoding vectors using a retrained linearized-tree decoder. Further analysis provides a positive answer for the rst question and also shows dierent layer captures dierent type of syntax with dierent concentration degrees. 1.3 Battle against the Huge Vocabulary Set Beam search is the standard searching algorithm to nd the best sequence E according to the probability P (EjF) estimated by the seq2seq model. At each step during beam search, the softmax function is calculated, as shown in Equa- tion 1.8. Decoding can be time consuming: 1) Most tasks generate target sentences in an on-line fashion, one token at a time, unlike batch mode used during training; 2) Decoding time is proportional to the beam size B, which is usually around 10 in machine translation and 50 in poetry generation; 3) Vocabulary size V can be very large (tens of thousands), which makes the softmax calculation as the computational bottleneck. Multiple approaches have been proposed to speed up beam search for RNN- based generation tasks: The rst line of research is to use specialized hardware, like Tensor Processing Unit (TPU) and low precision (Low-p) calculation (Wu et al., 2016). This method will usually speedup all parts of neural models. 25 The second line tries to compress the original large model to a small model by weight pruning (WP) (See et al., 2016) or sequence-level knowledge distillation (KD) (Kim and Rush, 2016). These methods require additional ne-tuning. The third line is to modify the softmax layer to speed up the decoding. Noise- contrastive estimation (NCE) (Gutmann and Hyv arinen, 2010) discriminates between the gold target word and k (k <<jVj) other sampled words. It has been suc- cessfully applied on several NLP tasks (Mnih and Teh, 2012; Vaswani et al., 2013; Williams et al., 2015; Zoph et al., 2016a). Morin and Bengio (2005) introduces hier- archical softmax (H-softmax) where log 2 jVj binary classications are performed rather than a singlejVj-way classication. However, these two methods can only speed up training and still suer at the decoding phase. Chen et al. (2016) propose dierentiated softmax (D-softmax) based on the idea that more parameters should be assigned for embeddings of frequent words and fewer for rare words. It can achieve speedups on both training and decoding. In this dissertation, I aim to speed up the decoding on GPU by shrinking the run-time target vocabulary size, and this approach is orthogonal to the methods above. It is important to note that approaches 1 and 2 will maintain or even increase the ratio of target word embedding parameters to the total parameters, thus the Beam Expansion and Softmax will occupy the same or greater portion of the decoding time. A small run-time vocabulary will dramatically reduce the time spent on these two portions and gain a further speedup even after applying other speedup methods. The rst algorithm is to use word alignments to select a very small number of candidate target words given the source sentence. Experiments on 4 NMT systems show that this algorithm can reduce the softmax computation time by 20 fold and achieve 2-3x overall speed up without BLEU loss. 26 However, this approach is only suitable for tasks where sensible alignments can be extracted, such as machine translation and summarization, and do not benet tasks like image caption or poem generation. Thus, I propose to use Locality Sensitive Hashing to shrink the run-time vocabulary. This is a machine learning free (ML-free) method, which means it can be used in plug-and-play style, without requiring additional tuning processes or alignment information, once the original model is done training. We re-designed the LSH algorithm for beam search on GPU by fully considering the underling architecture of CUDA-enabled GPUs: 1) A parallel Cuckoo hash table is applied for LSH code lookup (guaranteedO(1) lookup time); 2) Candidate lists are shared across beams to maximize the parallelism; 3) Top frequent words are merged into candidate lists to improve performance. Experiments on the same 4 NMT systems demonstrate that this algorithm can achieve 2x overall speedup without hurting BLEU. 1.4 Guide the Generation by External Knowl- edge In a lot of text generation tasks, there exists certain constraints that the output sequence should always follow. For example, if we treat constituency tree parsing as a generation problem, the generated output should always form a valid bracket tree who has same number of leaves as the number of words in input sentence. In poem generation task, each line should follow certain meter and dierent lines should rhyme. 27 Recurrent Neural Networks' generation can follow the constraints when they meet two conditions: 1) adequate training data are presented and 2) the con- straints are well exposed in the training data. Vinyals et al. (2015) utilize 8 mil- lion sentences and corresponding linearized bracket trees to train a seq2seq model for parsing task. They report that only 14 out of 1700 generated trees are mal- formed. However, those two conditions are not easy to satisfy for certain tasks, such as poem generation. There are not enough existing poems to train a good seq2seq model. Furthermore, the meters or rhymes are encoded in the pronuncia- tion of the word, whereas in a naive seq2seq model, each word is treated as merely a integer, thus no pronunciation information is exposed to the RNN. One approach is to synthesize additional articial training data which satises the constraints and hope the RNN model could catch the point. In this dissertation, we take another approach, to hybridize RNN with another computing device that could easily pose the constraints during decoding phase. Specically, we integrate a Weighted Finite State Acceptor (WFSA) into RNN's beam search, and successfully apply it to poem generation. Both the rhyme information and the number of syllables of each word are encoded in a giant FSA, which will guide the beam search to follow iambic pentameter, whereas the RNN is acting as Language Model (LM) to keep the generation uent. In this way, we provide a exible way to constrain RNN's output. 1.5 Quench the Data Thirst Neural Networks' extraordinary expressive power needs to pay a price, the large amount of training data, to prevent itself from overtting. However, the price can sometimes be too high, e.g. in machine translation, collecting millions of parallel 28 sentence pairs of rare languages, whose native speakers are no more than 1 million. Compared with traditional statistical machine translation, there is a threshold for the number of parallel sentences above which NMT is superior than SMT and below which NMT is inferior. A number of reasons could explain the decit of NMT when trained on small corpus: First, the vocabulary on source side and target side can only derive from the training data. The smaller the training data, the larger chance that certain words from the test set are out of the vocabulary (OOV). The problem becomes more serious for the morphologically rich language, as it contains more in ections. Second, due to the huge amount of parameters, the neural machine translation models are known for its superior expression power. In the meantime, it's more easy to overt on smaller dataset, thus degrading the translation quality. Third, besides the out-of-vocabulary words, there are still a big portion of words that appears very few times. During the training, the corresponding parameters of these rare words are merely updated since its initialization. It's a general problem for NMT, but when dealing with low resource languages and morphologically rich languages, the problem deteriorates. Fourth, experiments show that NMT tends to generate shorter translations when trained on smaller corpus. It will drop certain words of the source sentences thus lead to inadequate translations. In this dissertation, we will re-examine multiple technologies that are used in high resource language NMT and other NLP tasks, explore their variations and stack them together to build a strong NMT system for low resource languages. 29 1.6 Contributions This dissertation aims to provide a better interpretation for neural sequence mod- els, and to augment them with better decoding speed, exible external constrains and higher data eciency. The contributions of this dissertation are: 1. Novel discoveries and interpretation of LSTM-based Neural Machine Trans- lation systems (sequence-to-sequence model) and Language model (sequence model). In the encoder-decoder neural machine translation system, we nd that a collection of hidden units learns to explicitly generate target strings of appropriate lengths (Shi et al., 2016a). We also verify that NMT systems learn source language syntax as a by-product of training on string pairs, and both local and global syntactic information about source sentences is cap- tured by the encoder. Dierent types of syntax are stored in dierent layers, with dierent degrees of concentration (Shi et al., 2016b). 2. Novel algorithms to control the output of neural sequence models, which lead to faster decoding speed and a exible way to constrain the outputs. By using word-alignment information and Locality Sensitive Hashing to shrink the target-side run-time vocabulary, we gain a 2-3x speed-up on GPUs without hurting machine translation accuracy, on 4 language pairs (Shi and Knight, 2017). By integrating the Finite State Acceptor (FSA) with beam search during decoding, we successfully control the meter and style in poem gener- ation (Ghazvininejad et al., 2016, 2017). 3. Novel algorithms to improve Neural Machine Translation systems on low- resource data. By applying byte pair encoding on source side, regulating the NMT with recurrent dropout and layer normalization, improving rare words selection with hidden states normalization, and tuning translation with 30 length and coverage penalty, we achieve a 10.4 BLEU score improvement over the vanilla NMT system on Uyghur-English dataset, which is comparable to the performance of syntax-based machine translation. 4. A practical toolkit built on top of Zoph RNN, which enables decoding with word alignment, locality sensitive hashing and integration of nite state acceptor. 31 Chapter 2 Interpretation: Length Control Mechanism in Sequence-to-sequence Model Starting from this chapter, we will gradually open the black box of the seq2seq model. In this chapter, we begin the interpretation by investigating the length control mechanism of the seq2seq model: how the encoder-decoder translation systems output target strings of appropriate lengths. We start with a toy problem of sequence copying, and locate a single hidden unit acting as a counter of how many symbols have already been encoded/decoded. Then we switch to the real- word large scale machine translation system, and similarly, nd a collection of hidden units that learns to explicitly control the output length. 2.1 Introduction The neural encoder-decoder framework for machine translation (Neco and Forcada, 1997; Casta~ no and Casacuberta, 1997; Sutskever et al., 2014b; Dzmitry Bahdana et al., 2014; Luong et al., 2015) provides new tools for addressing the eld's di- cult challenges. In this framework (Figure 2.1), we use a recurrent neural network (encoder) to convert a source sentence into a dense, xed-length vector. We then 32 Figure 2.1: The encoder-decoder framework for neural machine translation (NMT) (Sutskever et al., 2014b). Here, a source sentence C B A (fed in reverse as A B C) is translated into a target sentence W X Y Z. At each step, an evolving real-valued vector summarizes the state of the encoder (left half) and decoder (right half). use another recurrent network (decoder) to convert that vector into a target sen- tence. In this chapter, we train long short-term memory (LSTM) neural units (Hochreiter and Schmidhuber, 1997a) trained with back-propagation through time (Werbos, 1990). A remarkable feature of this simple neural machine translation (NMT) model is that it produces translations of the right length. When we evaluate the system on previously unseen test data, using BLEU (Papineni et al., 2002a), we consistently nd the length ratio between MT outputs and human references translations to be very close to 1.0. Thus, no brevity penalty is incurred. This behavior seems to come for free, without special design. By contrast, builders of standard statistical machine translation (SMT) systems must work hard to ensure correct length. The original mechanism comes from the IBM SMT group, whose famous Models 1-5 included a learned table (yjx), with x and y being the lengths of source and target sentences (Brown et al., 1993). But they did not deploy this table when decoding a foreign sentence f into an English sentence e; it did not participate in incremental scoring and pruning of candidate translations. As a result (Brown et al., 1995): 33 \However, for a given f, if the goal is to discover the most probable e, then the product P(e) P(fje) is too small for long English strings as compared with short ones. As a result, short English strings are improperly favored over longer English strings. This tendency is counteracted in part by the following modication: Replace P(fje) with c length(e) P(fje) for some empirically chosen constant c. This modication is treatment of the symptom rather than treatment of the disease itself, but it oers some temporary relief. The cure lies in better modeling." More temporary relief came from Minimum Error-Rate Training (MERT) (Och, 2003), which automatically sets c to maximize BLEU score. MERT also sets weights for the language model P(e), translation model P(fje), and other features. The length feature combines so sensitively with other features that MERT fre- quently returns to it as it revises one weight at a time. NMT's ability to correctly model length is remarkable for these reasons: SMT relies on maximum BLEU training to obtain a length ratio that is prized by BLEU, while NMT obtains the same result through generic maximum likelihood training. Standard SMT models explicitly \cross o" source words and phrases as they are translated, so it is clear when an SMT decoder has nished translating a sentence. NMT systems lack this explicit mechanism. SMT decoding involves heavy search, so if one MT output path delivers an infelicitous ending, another path can be used. NMT decoding explores far fewer hypotheses, using a tight beam without recombination. In this chapter, we investigate how length regulation works in NMT. 34 2.2 A Toy Problem for Neural MT We start with a simple problem in which source strings are composed of symbols a and b. The goal of the translator is simply to copy those strings. Training cases look like this: a a a b b a <EOS> ! a a a b b a <EOS> b b a <EOS> ! b b a <EOS> a b a b a b a a <EOS> ! a b a b a b a a <EOS> b b a b b a b b a <EOS> ! b b a b b a b b a <EOS> The encoder must summarize the content of any source string into a xed- length vector, so that the decoder can then reconstruct it. 1 With 4 hidden LSTM units, our NMT system can learn to solve this problem after being trained on 2500 randomly chosen strings of lengths up to 9. 2 3 To understand how the learned system works, we encode dierent strings and record the resulting LSTM cell values. Because our LSTM has four hidden units, each string winds up at some point in four-dimensional space. We plot the rst two dimensions (unit 1 and unit 2 ) in the left part of Figure 2.2, and we plot the other two dimensions (unit 3 and unit 4 ) in the right part. There is no dimension reduction in these plots. Here is what we learn: unit 1 records the approximate length of the string. Encoding a string of length 7 may generate a value of -6.99 for unit 1 . 1 We follow Sutskever et al. (2014b) in feeding the input string backwards to the encoder. 2 Additional training details: 100 epochs, 100 minibatch size, 0.7 learning rate, 1.0 gradient clipping threshold. 3 We use the toolkit: https://github.com/isi-nlp/Zoph RNN 35 8 6 4 2 4 2 0 2 4 6 baa bbbba bbbaabbaa aab bbbabbbb bba babaaaaa bbaaaa abab bbb b aabb ba ab bbba bbab baaaaabbb abaa aaabba aabbaaabb babaaaa bbabaab abbb bbaaabaa abbab babbbbba baabba babaab baba aababa a baababb aba ababbaabb baababab abaaaabaa baaba bab babbaaa aaababa aabbabbaa aaaababab aaaaaaa aabaa aabaaabb baabb abbaabab aaaa ababa aaaabbabb babbbb aababbbab abbbbba bbbab bbbbabaa bbbabbaa abaaabaa aa baab bbbb babba aaa bbbbaabba bb babab abaabbab babbb abaaaabba aaab bbaa bbabbabab 0 1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1 0 1 baa bbbba bbbaabbaa aab bbbabbbb bbaaaa abab b ba ab bbab baaaaabbb abaa aaabba aabbaaabb bbabaab abbb bbaaabaa abbab baabba babaab aababa a aba baababab abaaaabaa bab aaababa aabbabbaa aaaababab aaaaaaa aabaaabb aaaa aaaabbabb aababbbab bbbab aa baab aaa abaabbab abaaaabba aaab bbabbabab Figure 2.2: After learning, the recurrent network can convert any string of a's and b's into a 4-dimensional vector. The left plot shows the encoded strings in dimensions described by the cell states of LSTM unit 1 (x-axis) and unit 2 (y-axis). unit 1 learns to record the length of the string, while unit 2 records whether there are more b's than a's, with a +1 bonus for strings that end in a. The right plot shows the cell states of LSTM unit 3 (x-axis) and unit 4 (y-axis). unit 3 records how many a's the string begins with, while unit 4 correlates with both length and the preponderance of b's. Some text labels are omitted for clarity. unit 2 records the number of b's minus the number of a's, thus assigning a more positive value to b-heavy strings. It also includes a +1 bonus if the string ends with a. unit 3 records a prex of the string. If its value is less than 1.0, the string starts with b. Otherwise, it records the number of leading a's. unit 4 has a more diuse function. If its value is positive, then the string consists of all b's (with a possible nal a). Otherwise, its value correlates with both negative length and the preponderance of b's. 36 1 2 3 4 <S> b b b a b a Encoder cell state -6 -4 -2 0 2 4 1 2 3 4 <S> a b a b b b Decoder cell state -6 -4 -2 0 2 4 <EOS> b a <S> a b a b b b Decoder output probability 0.001 0.01 0.1 1 Figure 2.3: The progression of LSTM state as the recurrent network encodes the string \a b a b b b". Columns show the inputs over time and rows show the outputs. Red color indicates positive values, and blue color indicates negative. The value of unit 1 decreases during the encoding phase (top gure) and increases during the decoding phase (middle gure). The bottom gure shows the decoder's probability of ending the target string (<EOS>). 37 For our purposes, unit 1 is the interesting one. Figure 2.3 shows the progression of \a b a b b b" as it gets encoded (top gure), then decoded (bottom two gures). During encoding, the value of unit 1 decreases by approximately 1.0 each time a letter is read. During decoding, its value increases each time a letter is written. When it reaches zero, it signals the decoder to output <EOS>. The behavior of unit 1 shows that the translator incorporates explicit length regulation. It also explains two interesting phenomena: When asked to transduce previously-unseen strings up to length 14, the sys- tem occasionally makes a mistake, mixing up an a or b. However, the output length is never wrong. 4 When we ask the system to transduce very long strings, beyond what it has been trained on, its output length may be slightly o. For example, it transduces a string of 28 b's into a string of 27 b's. This is because unit 1 is not incremented and decremented by exactly 1.0. 2.3 Full-Scale Neural Machine Translation Next we turn to full-scale NMT. We train on data from the WMT 2014 English- to-French task, consisting of 12,075,604 sentence pairs, with 303,873,236 tokens on the English side, and 348,196,030 on the French side. We use 1000 hidden LSTM units. We also use two layers of LSTM units between source and target. 5 After the LSTM encoder-decoder is trained, we send test-set English strings through the encoder portion. Every time a word token is consumed, we record 4 Machine translation researchers have also noticed that when the translation is completely wrong, the length is still correct (anonymous). 5 Additional training details: 8 epochs, 128 minibatch size, 0.35 learning rate, 5.0 gradient clipping threshold. 38 Top 10 units by ... 1st layer 2nd layer Individual R 2 0.868 0.947 Greedy addition 0.968 0.957 Beam search 0.969 0.958 Table 2.1: R 2 values showing how dierently-chosen sets of 10 LSTM hidden units correlate with length in the NMT encoder. the LSTM cell values and the length of the string so far. Over 143,379 token observations, we investigate how the LSTM encoder tracks length. With 1000 hidden units, it is dicult to build and inspect a heat map analogous to Figure 2.3. Instead, we seek to predict string length from the cell values, using a weighted, linear combination of the 1000 LSTM cell values. We use the least- squares method to nd the best predictive weights, with resulting R 2 values of 0.990 (for the rst layer, closer to source text) and 0.981 (second layer). So the entire network records length very accurately. However, unlike in the toy problem, no single unit tracks length perfectly. The best unit in the second layer is unit 109 , which correlates with R 2 =0.894. We therefore employ three mechanisms to locate a subset of units responsible for tracking length. We select the top k units according to: (1) individual R 2 scores, (2) greedy search, which repeatedly adds the unit which maximizes the set's R 2 value, and (3) beam search. Table 2.1 shows dierent subsets we obtain. These are quite predictive of length. Table 2.2 shows how R 2 increases as beam search augments the subset of units. 2.4 Mechanisms for Decoding For the toy problem, Figure 2.3 (middle part) shows how the cell value of unit 1 moves back to zero as the target string is built up. It also shows (lower part) how 39 k Best subset of LSTM's 1000 units R 2 1 109 0.894 2 334, 109 0.936 3 334, 442, 109 0.942 4 334, 442, 109, 53 0.947 5 334, 442, 109, 53, 46 0.951 6 334, 442, 109, 53, 46, 928 0.953 7 334, 442, 109, 53, 46, 433, 663 0.955 Table 2.2: Sets of k units chosen by beam search to optimally track length in the NMT encoder. These units are from the LSTM's second layer. 0 5 10 15 20 25 0 5 10 15 20 25 30 20 15 10 5 0 5 Encoding Decoding Unit 109 Unit 334 log P(<EOS>) Figure 2.4: Activation of translation unit 109 and unit 334 during the encoding and decoding of a sample sentence. Also shown is the softmax log-prob of output <EOS>. the probability of target word <EOS> shoots up once the correct target length has been achieved. MT decoding is trickier, because source and target strings are not necessarily the same length, and target length depends on the words chosen. Figure 2.4 shows the action of unit 109 and unit 334 for a sample sentence. They behave similarly on this sentence, but not identically. These two units do not form a simple switch that 40 controls length|rather, they are high-level features computed from lower/previous states that contribute quantitatively to the decision to end the sentence. Figure 2.4 also shows the log P(<EOS>) curve, where we note that the proba- bility of outputting <EOS> rises sharply (from 10 8 to 10 4 to 0:998), rather than gradually. 2.5 Conclusion We determine how target length is regulated in NMT decoding and nd that a collection of hidden units learns to explicitly implement this functionality. 41 Chapter 3 Interpretation: Encoder Has Grasped Syntactic Information The previous chapter began our interpretation of seq2seq models by investigat- ing a simple ability of the RNN to control its output sequence length. In this chapter, we challenge ourselves by asking a more complex question, whether the neural translation system learns syntactic information on the source side as a by-product of training. Traditional statistical machine translation evolves from phrase-based system to syntax-based system, so it is natural to ask whether neu- ral machine translation systems have already embedded syntax-related information into its mysterious hidden representation. The investigation of syntax is a chal- lenge because syntax information contains not only local, categorical features like part of speech (POS), but also global, structural patterns like dependency head or even the whole constituency tree. Thus, beyond the simple linear regression applied in previous chapter, we develop a new method based on transfer learning to testify those global structure syntax patterns. 3.1 Introduction The sequence to sequence model (seq2seq) has been successfully applied to neural machine translation (NMT) (Sutskever et al., 2014b; Cho et al., 2014) and can match or surpass MT state-of-art. Non-neural machine translation systems consist 42 chie y of phrase-based systems (Koehn et al., 2003) and syntax-based systems (Galley et al., 2004, 2006; DeNeefe et al., 2007; Liu et al., 2011; Cowan et al., 2006), the latter of which adds syntactic information to source side (tree-to-string), target side (string-to-tree) or both sides (tree-to-tree). As the seq2seq model rst encodes the source sentence into a high-dimensional vector, then decodes into a target sentence, it is hard to understand and interpret what is going on inside such a procedure. Considering the evolution of non-neural translation systems, it is natural to ask: 1. Does the encoder learn syntactic information about the source sentence? 2. What kind of syntactic information is learned, and how much? 3. Is it useful to augment the encoder with additional syntactic information? In this chapter, we focus on the rst two questions and propose two methods: We create various syntactic labels of the source sentence and try to pre- dict these syntactic labels with logistic regression, using the learned sentence encoding vectors (for sentence-level labels) or learned word-by-word hidden vectors (for word-level label). We nd that the encoder captures both global and local syntactic information of the source sentence, and dierent infor- mation tends to be stored at dierent layers. We extract the whole constituency tree of source sentence from the NMT encoding vectors using a retrained linearized-tree decoder. A deep analysis on these parse trees indicates that much syntactic information is learned, while various types of syntactic information are still missing. 43 Model Accuracy Majority Class 82.8 English to French (E2F) 92.8 English to English (E2E) 82.7 Table 3.1: Voice (active/passive) prediction accuracy using the encoding vector of an NMT system. The majority class baseline always chooses active. 3.2 Example As a simple example, we train an English-French NMT system on 110M tokens of bilingual data (English side). We then take 10K separate English sentences and label their voice as active or passive. We use the learned NMT encoder to convert these sentences into 10k corresponding 1000-dimension encoding vectors. We use 9000 sentences to train a logistic regression model to predict voice using the encoding cell states, and test on the other 1000 sentences. We achieve 92.8% accuracy (Table 3.2), far above the majority class baseline (82.8%). This means that in reducing the source sentence to a xed-length vector, the NMT system has decided to store the voice of English sentences in an easily accessible way. When we carry out the same experiment on an English-English (auto-encoder) system, we nd that English voice information is no longer easily accessed from the encoding vector. We can only predict it with 82.7% accuracy, no better than chance. Thus, in learning to reproduce input English sentences, the seq2seq model decides to use the xed-length encoding vector for other purposes. 3.3 Related work Interpreting Recurrent Neural Networks. The most popular method to visu- alize high-dimensional vectors, such as word embeddings, is to project them into two-dimensional space using t-SNE (van der Maaten and Hinton, 2008). Very few 44 works try to interpret recurrent neural networks in NLP. Karpathy et al. (2016) use a character-level LSTM language model as a test-bed and nd several activation cells that track long-distance dependencies, such as line lengths and quotes. They also conduct an error analysis of the predictions. Li et al. (2016) explore the syntac- tic behavior of an RNN-based sentiment analyzer, including the compositionality of negation, intensication, and concessive clauses, by plotting a 60-dimensional heat map of hidden unit values. They also introduce a rst-order derivative based method to measure each unit's contribution to the nal decision. Verifying syntactic/semantic properties. Several works try to build a good distributional representation of sentences or paragraph (Socher et al., 2013; Kalch- brenner et al., 2014; Kim, 2014; Zhao et al., 2015; Le and Mikolov, 2014; Kiros et al., 2015). They implicitly verify the claimed syntactic/semantic properties of learned representations by applying them to downstream classication tasks such as senti- ment analysis, sentence classication, semantic relatedness, paraphrase detection, image-sentence ranking, question-type classication, etc. Novel contributions of our work include: We locate a subset of activation cells that are responsible for certain syntac- tic labels. We explore the concentration and layer distribution of dierent syntactic labels. We extract whole parse trees from NMT encoding vectors in order to analyze syntactic properties directly and thoroughly. Our methods are suitable for large scale models. The models in this chapter are 2-layer 1000-dimensional LSTM seq2seq models. 45 Model Target Language Input vocabulary size Output vocabulary size Train/Dev/Test Corpora Sizes (sentence pairs) BLEU E2E English 200K 40K 4M/3000/2737 89.11 PE2PE Permuted English 200K 40K 4M/3000/2737 88.84 E2F French 200K 40K 4M/6003/3003 24.59 E2G German 200K 40K 4M/3000/2737 12.60 E2P Linearized constituency tree 200K 121 8162K/1700/2416 n/a Table 3.2: Model settings and test-set BLEU-n4r1 scores (Papineni et al., 2002b). 3.4 Datasets and models We train two NMT models, English-French (E2F) and English-German (E2G). To answer whether these translation models' encoders to learn store syntactic information, and how much, we employ two benchmark models: An upper-bound model, in which the encoder learns quite a lot of syntactic information. For the upper bound, we train a neural parser that learns to \translate" an English sentence to its linearized constitutional tree (E2P), following Vinyals et al. (2015). An lower-bound model, in which the encoder learns much less syntactic infor- mation. For the lower bound, we train two sentence auto-encoders: one translates an English sentence to itself (E2E), while the other translates a permuted English sentence to itself (PE2PE). We already had an indica- tion above (Section 3.2) that a copying model does not necessarily need to remember a sentence's syntactic structure. Figure 3.1 shows sample inputs and outputs of the E2E, PE2PE, E2F, E2G, and E2P models. 46 Figure 3.1: Sample inputs and outputs of the E2E, PE2PE, E2F, E2G, and E2P models. We use English-French and English-German data from WMT2014 (Bojar et al., 2014). We take 4M English sentences from the English-German data to train E2E and PE2PE. For the neural parser (E2P), we construct the training corpus follow- ing the recipe of Vinyals et al. (2015). We collect 162K training sentences from publicly available treebanks, including Sections 0-22 of the Wall Street Journal Penn Treebank (Marcus et al., 1993), Ontonotes version 5 (Pradhan and Xue, 2009) and the English Web Treebank (Petrov and McDonald, 2012). In addition to these gold treebanks, we take 4M English sentences from English-German data and 4M English sentences from English-French data, and we parse these 8M sen- tences with the Charniak-Johnson parser 1 (Charniak and Johnson, 2005). We call these 8,162K pairs the CJ corpus. We use WSJ Section 22 as our development set 1 The CJ parser is here https://github.com/BLLIP/bllip-parser and we used the pretrained model \WSJ+Gigaword-v2". 47 Parser WSJ 23 F1-score # valid trees (out of 2416) CJ Parser 92.1 2416 E2P 89.6 2362 (Vinyals et al., 2015) 90.5 unk Table 3.3: Labeled F1-scores of dierent parsers on WSJ Section 23. The F1-score is calculated on valid trees only. and section 23 as the test set, where we obtain an F1-score of 89.6, competitive with the previously-published 90.5 (Table 3.4). Model Architecture. For all experiments 2 , we use a two-layer encoder- decoder with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997b). We use a minibatch of 128, a hidden state size of 1000, and a dropout rate of 0.2. For auto-encoders and translation models, we train 8 epochs. The learning rate is initially set as 0.35 and starts to halve after 6 epochs. For E2P model, we train 15 epochs. The learning rate is initialized as 0.35 and starts to decay by 0.7 once the perplexity on a development set starts to increase. All parameters are re-scaled when the global norm is larger than 5. All models are non-attentional, because we want the encoding vector to summarize the whole source sentence. Table 3.4 shows the settings of each model and reports the BLEU scores. 3.5 Syntactic Label Prediction 3.5.1 Experimental Setup In this section, we test whether dierent seq2seq systems learn to encode syntactic information about the source (English) sentence. 2 We use the toolkit: https://github.com/isi-nlp/Zoph RNN 48 With 1000 hidden states, it is impractical to investigate each unit one by one or draw a heat map of the whole vector. Instead, we use the hidden states to predict syntactic labels of source sentences via logistic regression. For multi-class prediction, we use a one-vs-rest mechanism. Furthermore, to identify a subset of units responsible for certain syntactic labels, we use the recursive feature elimina- tion (RFE) strategy: the logistic regression is rst trained using all 1000 hidden states, after which we recursively prune those units whose weights' absolute values are smallest. We extract three sentence-level syntactic labels: 1. Voice: active or passive. 2. Tense: past or non-past. 3. TSS: Top level syntactic sequence of the constituent tree. We use the most frequent 19 sequences (\NP-VP", \PP-NP-VP", etc.) and label the remain- der as \Other". and two word-level syntactic labels: 1. POS: Part-of-speech tags for each word. 2. SPC: The smallest phrase constituent that above each word. Both voice and tense labels are generated using rule-based systems based on the constituent tree of the sentence. Figure 3.2 provides examples of our ve syntactic labels. When predicting these syntactic labels using corresponding cell states, we split the dataset into training and test sets. Table 3.4 shows statistics of each labels. For a source sentence s, 49 Label Train Test Number of classes Most frequent label Voice 9000 1000 2 Active Tense 9000 1000 2 Non-past TSS 9000 1000 20 NP-VP POS 87366 9317 45 NN SPC 81292 8706 24 NP Table 3.4: Corpus statistics for ve syntactic labels. Figure 3.2: The ve syntactic labels for sentence \This time , the rms were ready". s = [w 1 ;:::;w i ;:::;w n ] the two-layer encoder will generate an array of cell vectors c during encoding, c = [(c 1;0 ;c 1;1 );:::; (c i;0 ;c i;1 );:::; (c n;0 ;c n;1 )] We extract a sentence-level syntactic labelL s , and predict it using the encoding cell states that will be fed into the decoder: L s =g(c n;0 ) or L s =g(c n;1 ) 50 E2P E2F E2G E2E PE2PE 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Voice E2P E2F E2G E2E PE2PE 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Tense E2P E2F E2G E2E PE2PE 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy TSS E2P E2F E2G E2E PE2PE 0.2 0.4 0.6 0.8 1.0 Accuracy POS E2P E2F E2G E2E PE2PE 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy SPC Majority Class C0 All C0 Top10 C1 All C1 Top10 Figure 3.3: Prediction accuracy of ve syntactic labels on test. Each syntactic label is predicted using both the lower-layer cell states (C0) and higher-layer cell states (C1). For each cell state, we predict each syntactic label using all 1000 units (All), as well as the top 10 units (Top10) selected by recursive feature elimination. The horizontal blue line is the majority class accuracy. where g() is the logistic regression. Similarly, for extracting word-level syntactic labels: L w = [L w1 ;:::;L wi ;:::;L wn ] we predict each labelL wi using the cell states immediately after encoding the word w i : L wi =g(c i;0 ) or L Wi =g(c i;1 ) 51 3.5.2 Result Analysis Test-set prediction accuracy is shown in Figure 3.3. For voice and tense, the prediction accuracy of two auto-encoders is almost same as the accuracy of majority class, indicating that their encoders do not learn to record this information. By contrast, both the neural parser and the NMT systems achieve approximately 95% accuracy. When predicting the top-level syntactic sequence (TSS) of the whole sentence, the Part-of-Speech tags (POS), and smallest phrase constituent (SPC) for each word, all ve models achieve an accuracy higher than that of majority class, but there is still a large gap between the accuracy of NMT systems and auto-encoders. These observations indicate that the NMT encoder learns signicant sentence-level syntactic information|it can distinguish voice and tense of the source sentence, and it knows the sentence's structure to some extent. At the word level, the NMT's encoder also tends to cluster together the words that have similar POS and SPC labels. Dierent syntactic information tends to be stored at dierent layers in the NMT models. For word-level syntactic labels, POS and SPC, the accuracy of the lower layer's cell states (C0) is higher than that of the upper level (C1). For the sentence-level labels, especially tense, the accuracy of C1 is larger than C0. This suggests that the local features are somehow preserved in the lower layer whereas more global, abstract information tends to be stored in the upper layer. For two-classes labels, such as voice and tense, the accuracy gap between all units and top-10 units is small. For other labels, where we use a one-versus-rest strategy, the gap between all units and top-10 units is large. However, when predicting POS, the gap of neural parser (E2P) on the lower layer (C0) is much smaller. This comparison indicates that a small subset of units explicitly takes 52 Figure 3.4: E2F and E2F2P share the same English encoder. When training E2F2P, we only update the parameters of linearized tree decoder, keeping the English encoder's parameters xed. charge of POS tags in the neural parser, whereas for NMT, the POS info is more distributed and implicit. There are no large dierences between encoders of E2F and E2G regarding syntactic information. 3.6 Extract Syntactic Trees from Encoder 3.6.1 Experimental Setup We now turn to whether NMT systems capture deeper syntactic structure as a by- product of learning to translate from English to another language. We do this by predicting full parse trees from the information stored in encoding vectors. Since this is a structured prediction problem, we can no longer use logistic regression. Instead, we extract a constituency parse tree from the encoding vector of a model E2X by using a new neural parser E2X2P with the following steps: 1. Take the E2X encoder as the encoder of the new model E2X2P. 53 2. Initialize the E2X2P decoder parameters with a uniform distribution. 3. Fine-tune the E2X2P decoder (while keeping its encoder parameters xed), using the CJ corpus, the same corpus used to train E2P . Figure 3.4 shows how we construct model E2F2P from model E2F. For ne-tuning, we use the same dropout rate and learning rate updating conguration for E2P as described in Section 3.4. 3.6.2 Evaluation We train four new neural parsers using the encoders of the two auto-encoders and the two NMT models respectively. We use three tools to evaluate and analyze: 1. The EVALB tool 3 to calculate the labeled bracketing F1-score. 2. The zxx package 4 to calculate Tree edit distance (TED) (Zhang and Shasha, 1989). 3. The Berkeley Parser Analyser 5 (Kummerfeld et al., 2012) to analyze parsing error types. The linearized parse trees generated by these neural parsers are not always well-formed. They can be split into the following categories: Malformed trees: The linearized sequence can not be converted back into a tree, due to missing or mismatched brackets. 3 http://nlp.cs.nyu.edu/evalb/ 4 https://github.com/timtadh/zhang-shasha 5 https://github.com/jkkummerfeld/berkeley-parser-analyser 54 Model Perplexity on Train Perplexity on WSJ 22 Labeled F1 on WSJ23 # EVALB-trees (out of 2416) Avg. TED per sentence # Well-formed trees (out of 2416) PE2PE2P 1.83 1.92 46.64 818 34.43 2416 E2E2P 1.69 1.77 59.35 796 31.25 2416 E2G2P 1.39 1.41 80.34 974 17.11 2340 E2F2P 1.36 1.38 79.27 1093 17.77 2415 E2P 1.11 1.18 89.61 2362 11.50 2415 Table 3.5: Perplexity, labeled F1-score, and Tree Edit Distance (TED) of vari- ous systems. Labeled F1-scores are calculated on EVALB-trees only. Tree edit distances are calculated on the well-formed trees only. EVALB-trees are those whose number of leaves match the number of words in the source sentence, and are otherwise accepted by standard Treebank evaluation software. Well-formed trees: The sequence can be converted back into a tree. Tree edit distance can be calculated on this category. { Wrong length trees: The number of tree leaves does not match the number of source-sentence tokens. { Correct length trees: The number of tree leaves does match the number of source-sentence tokens. Before we move to results, we emphasize the following points: First, compared to the linear classier used in Section 3.5, the retrained decoder for predicting a linearized parse tree is a highly non-linear method. The syntactic prediction/parsing performance will increase due to such non-linearity. Thus, we do not make conclusions based only on absolute performance values, but also on a comparison against the designed baseline models. An improvement over the lower bound models indicates that the encoder learns syntactic information, whereas a decline from the upper bound model shows that the encoder loses certain syntactic information. 55 Second, the NMT's encoder maps a plain English sentence into a high-dimensional vector, and our goal is to test whether the projected vectors form a more syntactically- related manifold in the high-dimensional space. In practice, one could also predict parse structure for the E2E in two steps: (1) use E2E's decoder to recover the original English sentence, and (2) parse that sentence with the CJ parser. But in this way, the manifold structure in the high-dimensional space is destroyed during the mapping. Result Analysis Table 3.5 reports perplexity on training and development sets, the labeled F1-score on WSJ Section 23, and the Tree Edit Distance (TED) of various systems. Tree Edit Distance (TED) calculates the minimum-cost sequence of node edit operations (delete, insert, rename) between a gold tree and a test tree. When decoding with beam size 10, the four new neural parsers can generate well-formed trees for almost all the 2416 sentences in the WSJ section 23. This makes TED a robust metric to evaluate the overall performance of each parser. Table 3.5 reports the average TED per sentence. Trees extracted from E2E and PE2PE encoding vectors (via models E2E2P and PE2PE2P, respectively) get TED above 30, whereas the NMT systems get approximately 17 TED. Among the well-formed trees, around half have a mismatch between number of leaves and number of tokens in the source sentence. The labeled F1-score is reported over the rest of the sentences only. Though biased, this still re ects the overall performance: we achieve around 80 F1 with NMT encoding vectors, much higher than with the E2E and PE2PE encoding vectors (below 60). 56 Model Labeled F1 POS Tagging Accuracy PE2PE2P 58.67 54.32 E2E2P 70.91 68.03 E2G2P 85.36 85.30 E2F2P 86.62 87.09 E2P 93.76 96.00 Table 3.6: Labeled F1-scores and POS tagging accuracy on the intersection set of EVALB-trees of dierent parsers. There are 569 trees in the intersection, and the average length of corresponding English sentence is 12.54. Fine-grained Analysis Besides answering whether the NMT encoders learn syntactic information, it is interesting to know what kind of syntactic information is extracted and what is not. As Table 3.5 shows, dierent parsers generate dierent numbers of trees that are acceptable to Treebank evaluation software (\EVALB-trees"), having the correct number of leaves and so forth. We select the intersection set of dierent models' EVALB-trees. We get a total of 569 shared EVALB-trees. The average length of the corresponding sentence is 12.54 and the longest sentence has 40 tokens. The average length of all 2416 sentences in WSJ section 23 is 23.46, and the longest is 67. As we do not apply an attention model for these neural parsers, it is dicult to handle longer sentences. While the intersection set may be biased, it allows us to explore how dierent encoders decide to capture syntax on short sentences. Table 3.6 shows the labeled F1-scores and Part-of-Speech tagging accuracy on the intersection set. The NMT encoder extraction achieves around 86 percent tagging accuracy, far beyond that of the auto-encoder based parser. Besides the tagging accuracy, we also utilize the Berkeley Parser Analyzer (Kummerfeld et al., 2012) to gain a more linguistic understanding of predicted 57 Sense Confusion Single Word Phrase Different label Noun boundary error NP Internal Unary Modifier Attach Verb taking wrong arguments PP Attach VP Attach Co-ordination 0.057 0.150 0.137 0.022 0.053 0.123 0.205 0.035 0.242 0.024 0.081 E2P (Ave. Bracket Err) 16.58 2.74 2.52 2.20 2.17 1.98 1.46 1.44 1.44 1.36 1.14 E2F2P (Ratio) 17.77 3.31 2.42 3.20 1.83 2.25 2.03 2.75 1.27 1.27 1.05 E2G2P (Ratio) 23.77 5.01 5.00 3.10 3.17 3.21 1.69 3.50 1.82 4.55 1.78 E2E2P (Ratio) 32.19 5.12 5.26 5.10 3.58 3.71 1.82 2.44 2.27 5.64 0.22 PE2PE2P (Ratio) Figure 3.5: For model E2P (the red bar), we show the average number of bracket errors per sentence due to the top 11 error types. For other models, we show the ratio of each model's average number of bracket errors to that of model E2P . Errors analyzed on the intersection set. The table is sorted based on the ratios of the E2F 2P model. parses. Like TED, the Berkeley Parser Analyzer is based on tree transformation. It repairs the parse tree via a sequence of sub-tree movements, node insertions and deletions. During this process, multiple bracket errors are xed, and it associates this group of node errors with a linguistically meaningful error type. The rst column of Figure 3.5 shows the average number of bracket errors per sentence for model E2P on the intersection set. For other models, we report the ratio of each model to model E2P. Kummerfeld et al. (2013) and Kummerfeld et al. (2012) give descriptions of dierent error types. The NMT-based predicted parses introduce around twice the bracketing errors for the rst 10 error types, whereas for \Sense Confusion", they bring more than 16 times bracket errors. \Sense confusion" is the case where the head word of a phrase receives the wrong POS, resulting in an attachment error. Figure 3.6 shows an example. Even though we can predict 86 percent of parts-of-speech correctly from NMT encoding vectors, the other 14 percent introduce quite a few attachment errors. 58 Figure 3.6: Example of Sense Confusion. The POS tag for word \beyond" is predicted as \RB" instead of \IN", resulting in a missing prepositional phrase. NMT sentence vectors encode a lot of syntax, but they still cannot grasp these subtle details. 59 3.7 Conclusion We investigate whether NMT systems learn source-language syntax as a by-product of training on string pairs. We nd that both local and global syntactic informa- tion about source sentences is captured by the encoder. Dierent types of syntax is stored in dierent layers, with dierent concentration degrees. We also carry out a ne-grained analysis of the constituency trees extracted from the encoder, highlighting what syntactic information is still missing. 60 Chapter 4 Augmentation: Speed Up Decoding using Word Alignment The previous three chapters have gradually opened up the black box of the seq2seq model and provided insightful interpretations of the hidden states and certain inter- nal behaviors. From this chapter, we will start to discuss the challenges during the decoding phase. When a certain seq2seq model is ready and deployed into pro- duction to provide large scale services, one crucial factor for good user experience is decoding speed. In this chapter, we focus on neural machine translation and aim to speed up its decoding by shrinking the run-time target-vocabulary use the word alignment information. Using this method, we get a 2x overall speed-up over a highly-optimized GPU implementation on four language pairs, without hurting the BLEU. On certain low-resource language pairs, the same method even improve BLEU by 0.5 points. 4.1 Introduction Neural Machine Translation (NMT) has been demonstrated as an eective model and been put into large-scale production (Wu et al., 2016; He, 2015). For online translation services, decoding speed is a crucial factor to achieve a better user expe- rience. Several recently proposed training methods (Shen et al., 2015; Wiseman and Rush, 2016) aim to solve the exposure bias problem, but require decoding the 61 Sub-module Full vocab WA50 Speedup Total 1002.78 s 481.52 s 2.08 { Beam expansion 174.28 s 76.52 s 2.28 { Source-side 83.67 s 83.44 s 1 { Target-side 743.25 s 354.52 s 2.1 { { Softmax 402.77 s 20.68 s 19.48 { { Attention 123.05 s 123.12 s 1 { { 2nd layer 64.72 s 64.76 s 1 { { 1st layer 88.02 s 87.96 s 1 Shrink vocab - 0.39 s - BLEU 25.16 25.13 - Table 4.1: Time breakdown and BLEU score of full vocabulary decoding and our \WA50" decoding, both with beam size 12. WA50 means decoding informed by word alignments, where each source word can select at most 50 relevant target words. The model is a 2-layer, 1000-hidden dimension, 50,000-target vocabulary LSTM seq2seq model with local attention trained on the ASPEC Japanese-to- English corpus (Nakazawa et al., 2016). The time is measured on a single Nvidia Tesla K20 GPU. whole training set multiple times, which is extremely time-consuming for millions of sentences. Slow decoding speed is partly due to the large target vocabulary size V, which is usually in the tens of thousands. The rst two columns of Table 4.1 show the breakdown of the runtimes required by sub-modules to decode 1812 Japanese sen- tences to English using a sequence-to-sequence model with local attention (Luong et al., 2015). Softmax is the most computationally intensive part, where each hid- den vector h t 2 R d needs to dot-product with V target embeddings e i 2 R d . It occupies 40% of the total decoding time. Another sub-module whose computation time is proportional to V is Beam Expansion, where we need to nd the top B words among all V vocabulary according to their probability. It takes around 17% of the decoding time. Several approaches have proposed to improve decoding speed: 62 1. Using special hardware, such as GPU and Tensor Processing Unit (TPU), and low-precision calculation (Wu et al., 2016). 2. Compressing deep neural models through knowledge distillation and weight pruning (See et al., 2016; Kim and Rush, 2016). 3. Several variants of Softmax have been proposed to solve its poor scaling properties on large vocabularies. Morin and Bengio (2005) propose hierar- chical softmax, where at each step log 2 V binary classications are performed instead of a single classication on a large number of classes. Gutmann and Hyv arinen (2010) propose noise-contrastive estimation which discriminate between positive labels and k (k<<V ) negative labels sampled from a dis- tribution, and is applied successfully on natural language processing tasks (Mnih and Teh, 2012; Vaswani et al., 2013; Williams et al., 2015; Zoph et al., 2016a). Although these two approaches provide good speedups for training, they still suer at test time. Chen et al. (2016) introduces dierentiated softmax, where frequent words have more parameters in the embedding and rare words have less, oering speedups on both training and testing. In this chapter, we aim to speed up decoding by shrinking the run-time target vocabulary size, and this approach is orthogonal to the methods above. It is important to note that approaches 1 and 2 will maintain or even increase the ratio of target word embedding parameters to the total parameters, thus the Beam Expansion and Softmax will occupy the same or greater portion of the decoding time. A small run-time vocabulary will dramatically reduce the time spent on these two portions and gain a further speedup even after applying other speedup methods. 63 To shrink the run-time target vocabulary, our rst method uses Locality Sen- sitive Hashing. Vijayanarasimhan et al. (2015) successfully applies it on CPUs and gains speedup on single step prediction tasks such as image classication and video identication. Our second method is to use word alignments to select a very small number of candidate target words given the source sentence. Recent works (Jean et al., 2015; Mi et al., 2016; L'Hostis et al., 2016) apply a similar strategy and report speedups for decoding on CPUs on rich-source language pairs. Our major contributions are: 1. To our best of our knowledge, this chapter is the rst attempt to apply LSH technique on sequence generation tasks on GPU other than single-step classication on CPU. We nd current LSH algorithms have a poor perfor- mance/speed trade-o on GPU, due to the large overhead introduced by many hash table lookups and list-merging involved in LSH. 2. For our word alignment method, we nd that only the candidate list derived from lexical translation table of IBM model 4 is adequate to achieve good BLEU/speedup trade-o for decoding on GPU. There is no need to combine the top frequent words or words from phrase table, as proposed in Mi et al. (2016). 3. We conduct our experiments on GPU and provide a detailed analysis of BLEU/speedup trade-o on both resource-rich/poor language pairs and both attention/non-attention NMT models. We achieve more than 2x speedup on 4 language pairs with only a tiny BLEU drop, demonstrating the robustness and eciency of our methods. 64 4.2 Methods At each step during decoding, the softmax function is calculated as: P (y =jjh i ) = e h T i w j +b j P V k=1 e h T i w k +b k (4.1) whereP (y =jjh i ) is the probability of word j = 1:::V given the hidden vector h i 2R d ;i = 1:::B. B represents the beam size. w j 2R d is output word embedding and b j 2 R is the corresponding bias. The complexity isO(dBV ). To speed up softmax, we use word frequency, locality sensitive hashing, and word alignments respectively to select C (C <<V ) potential words and evaluate their probability only, reducing the complexity toO(dBC +overhead). 4.2.1 Word Frequency A simple baseline to reduce target vocabulary is to select the top C words based on their frequency in the training corpus. There is no run-time overhead and the overall complexity isO(dBC). 4.2.2 Locality Sensitive Hashing The word j = arg max k P (y = kjh i ) will have the largest value of h T i w j +b j . Thus the arg max problem can be converted to nding the nearest neighbor of vector [h i ; 1] among the vectors [w j ;b j ] under the distance measure of dot-product. Locality Sensitive Hashing (LSH) is a powerful technique for the nearest neighbor problem. We employ the winner-take-all (WTA) hashing (Yagnik et al., 2011) dened as: 65 WTA(x2R d ) = [I 1 ;:::;I p ;:::;I P ] (4.2) I p = arg max K k=1 Permute p (x)[k] (4.3) WTA band (x) = [B 1 ;:::;B w ;:::;B W ] (4.4) B w = [I (w1)u+1 ;:::;I (w1)u+i ;:::;I wu ] (4.5) u =P=W (4.6) where P distinct permutations are applied and the index of the maximum value of the rst K elements of each permutations is recorded. To perform approxi- mate nearest neighbor searching, we follow the scheme used in (Dean et al., 2013; Vijayanarasimhan et al., 2015): 1. Split the hash codeWTA(x) intoW bands (as shown in equation 5.6), with each band P W log 2 (K) bits long. 2. Create W hash tables [T 1 ;:::;T w ;:::;T W ], and hash every word index j into every table T w using WTA band (w j )[w] as the key. 3. Given the hidden vectorh i , extract a list of word indexes from each tableT w using the key WTA band (h i )[w]. Then we merge the W lists and count the number of the occurrences of each word index. Select the topC word indexes with the largest counts, and calculate their probability using equation 5.1. The 4 hyper-parameters that dene a WTA-LSH arefK;P;W;Cg. The run- time overhead comprises hashing the hidden vector, W times hash table lookups andW lists merging. The overall complexity isO(B(dC +KP +W +WN avg ))), where N avg is the average number of the word indexes stored in a hash bin of T w . 66 Although the complexity is much smaller thanO(dBV ), the runtime in practice is not guaranteed to be shorter, especially on GPUs, as hash table lookups introduce too many small kernel launches and list merging is hard to parallelize. 4.2.3 Word Alignment Intuitively, LSH shrinks the search space utilizing the spatial relationship between the query vector and database vectors in high dimension space. It is a task- independent technique. However, when focusing on our specic task (MT), we can employ translation-related heuristics to prune the run-time vocabulary precisely and eciently. One simple heuristic relies on the fact that each source word can only be trans- lated to a small set of target words. The word alignment model, a foundation of phrase-base machine translation, also follows the same spirit in its generative story: each source word is translated to zero, one, or more target words and then reordered to form target sentences. Thus, we apply the following algorithm to reduce the run-time vocabulary size: 1. Apply IBM Model 4 and the grow-diag-nal heuristic on the training data to get word alignments. Calculate the lexical translation table P(ejf) based on word alignments. 2. For each word f in the source vocabulary of the neural machine translation model, store the top M target words according to P(ejf) in a hash table T f2e =ff : [e 1 ;:::e M ]g 3. Given a source sentence s = [f 1 ;:::;f N ], extract the candidate target word list fromT f2e for each source wordf i . Merge theN lists to form the reduced target vocabulary V new ; 67 4. Construct the new embedding matrix and bias vector according toV new , then perform the normal beam search on target side. The only hyper-parameter isfMg, the number of candidate target words for each source word. Given a source sentence of length L s , the run-time overhead includes L s times hash table lookups and L s lists merging. The complexity for each decoding step isO(dBjV new j + (L s +L s M)=L t ), where L t is the maximum number of decoding steps. Unlike LSH, these table lookups and list mergings are performed once per sentence, and do not depend on the any hidden vectors. Thus, we can overlap the computation with source side forward propagation. 4.3 Experiments To examine the robustness of these decoding methods, we vary experiment settings in dierent ways: 1) We train both attention (Luong et al., 2015) and non-attention (Sutskever et al., 2014b) models; 2) We train models on both resource-rich language pairs, French to English (F2E) and Japanese to English (J2E), and a resource- poor language pair, Uzbek to English (U2E); 3) We translate both to English (F2E, J2E, and U2E) and from English (E2J). We use 2-layer LSTM seq2seq models with dierent attention settings, hidden dimension sizes, dropout rates, and initial learning rates, as shown in Table 4.3. We use the ASPEC Japanese- English Corpus (Nakazawa et al., 2016), French-English Corpus from WMT2014 (Bojar et al., 2014), and Uzbek-English Corpus (Consortium, 2016). Table 4.2 shows the decoding results of the three methods. Decoding with word alignments achieves the best performance/speedup trade-o across all four translation directions. It can halve the overall decoding time with less than 0.17 BLEU drop. Table 4.1 compares the detailed time breakdown of full-vocabulary 68 J2E E2J F2E U2E TC BLEU X TC BLEU X TC BLEU X TC BLEU X Full 0.87 25.16 1 0.95 33.87 1 0.84 28.12 1 0.9 11.67 1 TF1K 0.14 13.42 2.11 0.15 18.91 2.42 0.1 12.1 2.32 0.29 8.78 1.65 TF5K 0.49 21.31 1.93 0.56 29.77 2.23 0.38 21.98 2.04 0.67 11.54 1.51 TF10K 0.67 23.62 1.76 0.75 32.28 2.04 0.56 24.88 1.78 0.81 11.67 1.33 TF20K 0.78 24.61 1.48 0.87 33.41 1.74 0.72 26.95 1.42 0.89 11.66 1.09 LSH1K - 19.45 0.026 - 22.23 0.027 - 3.43 0.036 - 9.41 0.025 LSH5K - 23.43 0.023 - 30.63 0.025 - 12.81 0.031 - 11.41 0.022 LSH10K - 24.82 0.022 - 32.63 0.024 - 18.45 0.028 - 11.63 0.020 LSH20K - 25.20 0.020 - 33.78 0.022 - 24.31 0.025 - 11.73 0.018 WA10 0.75 24.74 2.12 0.77 33.24 2.46 0.72 27.9 2.37 0.66 12.17 1.7 WA50 0.82 25.13 2.08 0.85 33.79 2.43 0.77 27.94 2.34 0.71 12.01 1.67 WA250 0.84 25.13 1.89 0.88 34.05 2.27 0.8 27.95 2.1 0.73 11.94 1.62 WA1000 0.85 25.17 1.57 0.9 33.97 1.93 0.82 28.08 1.67 0.75 11.89 1.58 Table 4.2: Word type coverage (TC), BLEU score, and speedups (X) for full- vocabulary decoding (Full), top frequency vocabulary decoding (TF*), LSH decod- ing (LSH*), and decoding with word alignments (WA*). TF10K represents decod- ing with top 10,000 frequent target vocabulary (C = 10; 000). WA10 means decod- ing with word alignments, where each source word can select at most 10 candi- date target words (M = 10). For LSH decoding, we choose (32, 5000, 1000) for (K,P ,W ), and vary C. decoding and WA50 decoding. WA50 can gain a speedup of 19.48x and 2.28x on softmax and beam expansion respectively, leading to an overall 2.08x speedup with only 0.03 BLEU drop. In contrast, decoding with top frequent words will hurt the BLEU rapidly as the speedup goes higher. We calculate the word type coverage (TC) for the test reference data as follows: TC = jfrun-time vocabg\fword types in testgj jfword types in testgj The top 1000 words only cover 14% word types of J2E test data, whereas WA10 covers 75%, whose run-time vocabulary is no more than 200 for a 20 words source sentence. 69 J2E E2J F2E U2E Source Vocab 80K 88K 200K 50K Target Vocab 50K 66K 40K 25K #Tokens 70.4M 70.4M 652M 3.3M #Sent pairs 1.4M 1.4M 12M 88.7K Attention Yes Yes No Yes Dimension 1000 1000 1000 500 Dropout 0.2 0.2 0.2 0.5 Learning rate 0.5 1 0.35 0.5 Table 4.3: Training congurations on dierent language pairs. The speedup of English-to-Uzbek translation is relatively low (around 1.7x). This is because the original full vocabulary size is small (25k), leaving less room for shrinkage. LSH achieves better BLEU than decoding with top frequent words of the same run-time vocabulary size C on attention models. However, it introduces too large an overhead (50 times slower), especially when softmax is highly optimized on GPU. When doing sequential beam search, search error accumulates rapidly. To reach reasonable performance, we have to apply an adequately large number of permutations (P = 5000). We also nd that decoding with word alignments can even improve BLEU on resource-poor languages (12.17 vs. 11.67). Our conjecture is that rare words are not trained enough, so neural models confuse them, and word alignments can provide a hard constraint to rule out the unreasonable word choices. 4.4 Conclusion We apply word alignments to shrink run-time vocabulary to speed up neural machine translation decoding on GPUs, and achieve more than 2x speedup on 4 translation directions without hurting BLEU. We also compare with two other 70 speedup methods: decoding with top frequent words and decoding with LSH. Experiments and analyses demonstrate that word alignments provides accurate candidate target words and introduces only a tiny overhead over a highly-optimized GPU implementation. 71 Chapter 5 Augmentation: Speed Up Decoding using Locality Sensitive Hashing In the last chapter, we successfully utilized the word alignment information to shrink the run-time vocabulary, leading to a 2x over-all speedups. However, this method is only eective in applications when word alignment is available, whereas in applications such as image caption or poem generation, no such concept of word alignment exists. Thus, in this chapter, we use locality sensitive hashing (LSH), a machine-learning free method, to shrink the run-time vocabulary. With LSH, the small candidate word list is selected only using the spatial relationship between current hidden vector and all word embeddings. It requires no additional word alignment or ne-tuning, and can be used in plug-and-play style on any trained sequence model. The previous chapter tried vanilla Winner-take-all LSH on GPU, but only got a negative result, due to the large overhead on GPU. In this chapter, we redesign the WTA-LSH algorithm by fully considering the underly architecture of CUDA-enabled GPUs. Experiments on 4 large-scale neural machine translation models demonstrate that our algorithm can achieve 2x overall speedup without hurting BLEU on GPU. 1 1 The decoding speed of the baseline in this chapter is generally faster than that of previous chapter because we further optimized the decoding phase of our toolkit. 72 5.1 Introduction Beam search has been widely applied as the decoding technique of choice for Recur- rent Neural Network (RNN) based text generation tasks, such as machine trans- lation (Wu et al., 2016), summarization (Rush et al., 2015), image captioning (Xu et al., 2015) and poetry generation (Ghazvininejad et al., 2016). Decoding can be time consuming: 1) Most tasks generate target sentences in an on-line fashion, one token at a time, unlike batch mode used during training; 2) Decoding time is proportional to the beam size B, which is usually around 10 in machine trans- lation and 50 in poetry generation; 3) Vocabulary size V can be very large (tens of thousands), and the computational complexity of the two major components, Softmax and Beam expansion, are proportional to V . As Table 5.1 shows, these two parts occupy 46% and 30% of the decoding time respectively. The major bottleneck for Softmax is the matrix multiplication between hidden states H 2 R Bd and the word embedding matrix E 2 R djVj . As for Beam expansion, tremendous time is spent on transferring the output of Softmax P 2 R BjVj from device memory to host memory and the following heap-sort on CPU, with complexityO(log(B)jVj). This chapter aims to speed up the beam search by reducing the runtime vocab- ulary size, using Locality Sensitive Hashing (Gionis et al., 1999). We rst hash each high-dimensional word embedding into dierent buckets so that the embed- dings which are closer under a certain distance measure will be in the same bucket with high probability. We choose the winner-take-all (WTA) hash (Yagnik et al., 2011) due to its robustness against the perturbations of numerical value and corre- lation with dot-product distance. Then during decoding, we construct a candidate word listV LSH by retrieving the words that share the same bucket with the hidden 73 stateH under the same hash function. Finally an actual dot-product is calculated between H and the shrunken word embedding matrix E LSH 2R djV LSH j . Device Percent Full Vocab LSH Speedup Total GPU+CPU 100 % 1178.5 s 574.3 s 2.05 Source side GPU 7 % 88 s 88.1 s 1.00 Target side GPU 63 % 735.5 s 387.2 s 1.90 { Softmax GPU 43 % 505.3 s 157 s 3.22 { 2nd layer GPU 10 % 113.7 s 113.7 s 1.00 { 1st layer GPU 10 % 115.2 s 115.2 s 1.00 Beam Expansion GPU+CPU 30 % 352.4 s 96.4 s 3.66 { Device2Host data transfer GPU+CPU 12 % 138.1 s 25.8 s 5.35 { Heapsort CPU 15 % 176 s 31.9 s 5.52 { Hidden states reorder GPU 3 % 38.3 s 38.7 s 0.99 Runtime vocab size - - 40,000 5,792 6.91 BLEU - - 28.12 27.81 - Table 5.1: Time breakdown, runtime vocabulary size, and BLEU score of full vocabulary decoding and LSH decoding. The model is a 2-layer, 1000-hidden dimension, 40000 target vocabulary LSTM seq2seq model trained on a French to English corpus (Bojar et al., 2014). The experiments are conducted on a Nvidia K20 GPU and a single-core 2.4GHz Intel Xeon CPU. The code is compiled against CUDA 8.0. Vijayanarasimhan et al. (2015) successfully applies this idea to speed up both training and inference of deep neural networks with large output space on multi- core CPU. However, beam search on GPU poses several unique and hard challenges: 1. The LSH schema used by Vijayanarasimhan et al. (2015) is not GPU-friendly: a) It uses a hash table on CPU to store the bucket key and word list, and the underlying data structure is usually a balanced binary search tree or linked list, both of which are hard to transport to GPU. b) It requires sorting to get the candidate lists, which can not easily parallelize on GPU. c) It processes each hidden vector in the batch one by one, whereas the matrix dot-product 74 on GPU is calculated across the whole batch to fully take advantage of the GPU parallelism. 2. Beam search generally requires high recall of the top words according to the actual dot-product, because the error will accumulate fast as the sentence is built up. Whereas in practice, we nd LSH alone does not delivery an adequate recall/speedup trade-o. Our main contribution is to re-design the LSH algorithm on GPU for beam search to address the above-mentioned challenges. After fully considering the com- putational capabilities of CUDA-enabled GPU, we propose the following algorithm for LSH: 1. We implement a parallel Cuckoo hash table (Pagh and Rodler, 2004) for LSH code lookup on GPU. The Cuckoo hash table can achieve worst-case constant-time lookup, oering an excellent load balance that is important for GPU parallelism. 2. The candidate word list is shared across dierent beams so that GPU can calculate the actual matrix multiplication between H and E LSH in a batch mode. We also use a threshold to select the candidate list to avoid the expensive sorting operation. 3. To solve the low-recall problem, we always merge the top frequent words into the candidate list. We conduct experiments on the task of neural machine translation (NMT). We train NMT models on 4 language pairs, and our LSH algorithm can achieve a consistent 2x overall speedup over the full vocabulary decoding with less than 0.4 BLEU drop. 75 5.2 Related Work Several approaches have been proposed to speed up beam search for RNN-based generation tasks. The rst line of research is to use specialized hardware, like Tensor Processing Unit (TPU) and low precision (Low-p) calculation (Wu et al., 2016). This method will usually speedup all parts of the neural models. The second line tries to compress the original large model to a small model by weight pruning (WP) (See et al., 2016) or sequence-level knowledge distillation (KD) (Kim and Rush, 2016). These methods require additional ne-tuning. The third line is to modify the Softmax layer to speed up the decoding. Noise- contrastive estimation (NCE) (Gutmann and Hyv arinen, 2010) discriminates between the gold target word and k (k <<jVj) other sampled words. It has been suc- cessfully applied on several NLP tasks (Mnih and Teh, 2012; Vaswani et al., 2013; Williams et al., 2015; Zoph et al., 2016a). Morin and Bengio (2005) introduces hier- archical softmax (H-softmax) where log 2 jVj binary classications are performed rather than a singlejVj-way classication. However, these two methods can only speedup training and still suer at the decoding phase. Chen et al. (2016) propose dierentiated softmax (D-softmax) based on the idea that more parameters should be assigned for embeddings of frequent words and fewer for rare words. It can achieve speedups on both training and decoding. The fourth line of research uses word alignments (WA) trained on the parallel corpus to construct a small runtime vocabulary for each sentence (Jean et al., 2015; Mi et al., 2016; L'Hostis et al., 2016; Shi and Knight, 2017). However, this approach is only suitable for tasks where sensible alignments can be extracted, such as machine translation and summarization, and do not benet tasks like image caption or poem generation. 76 Speedup Train Speedup Decode ML Free TPU X X X Low-p X X X WP X KD X NCE X n/a D-softmax X X X H-softmax X n/a WA X X LSH X X Table 5.2: Comparison of speedup methods. Table 5.2 compares dierent speed up methods. Compared to these existing methods, LSH has the following advantages: 1. It is orthogonal to the rst two lines of research. The rst two lines of approaches do not decrease the ratio of the number of word embedding parameters to the number of the rest parameters. Thus, LSH can be applied for further speedup. 2. It is a machine learning free (ML-free) method, which means it can be used in plug-and-play style, without requiring additional tuning processes or align- ment information, once the original model is done training. 5.3 GPU Computing Concept In this section, we describe the basic concepts of CUDA-enabled GPU computing to better motivate our decisions in re-designing the LSH algorithm. 77 5.3.1 Warp Kernels are functions executed on GPU. Each kernel will be executed by many GPU threads in parallel. These threads are further grouped into warps, where each warp consists of 32 threads. All 32 threads in a wrap will execute the same instruction concurrently. However, due to the branches inside the code, some threads in a warp will be diverged and the rest of the threads in the same wrap will have to idle in that cycle. We should make every eort to avoid warp divergence to maximize GPU usage. Figure 5.1 shows an example of warp divergence. Figure 5.1: Illustration of kernel, warp and warp divergence. The solid line means the thread is active, and the dashed line means the thread is idle. Because of the branch, the rst half of the warp will execute instruction A and be idle when the other half executes instruction B. 5.3.2 Memory Hierarchy The bandwidth of GPU global memory is 208 GB/s for Tesla K20 GPU, and it takes 400-800 cycles for global memory access. Another faster but limited memory is shared memory, whose bandwidth is more than 2 TB/s and only takes 3-4 cycles for each access. The way to access the global memory also strongly aects the speed. Coalesced access, where threads in the same wrap will access consecutive address, can take 78 the full bandwidth, i.e. around 200 GB/s. Whereas the bandwidth of random access can be as low as 20 GB/s. In practice, we will load data from global memory in a coalesced way to shared memory, then manipulate the data on shared memory, and nally write back to global memory in a coalesced way again. 5.3.3 Latency Hiding The global memory access can lead to a large latency (400-800 cycles). However, the GPU scheduler has a smart strategy to hide the latency: when a warp needs to access global memory, GPU will put it into wait, and switch to another warp to execute. In order to hide the latency completely, each kernel should launch enough threads (more than 1000) in parallel. 5.4 Locality Sensitive Hashing At each step during beam search with beam size B, given the hidden state H2 R Bd from the top RNN layer, the probability distribution P2R BjVj over V will be calculated by softmax: P [i;j] =p(y =jjH i ) = e Logit[i;j] P V k=1 e Logit[i;k] (5.1) Logit =HE (5.2) where H i is the ith row of H and E2R djVj is the word embedding matrix. The computational intensive part is the matrix product in 5.2, whose complexity is O(dBjVj). 79 Softmax LSH Softmax on CPU LSH Softmax on GPU (ours) WTA-hash Top K collision in hash table Matrix multiply Normalization for i = 1 .. B Matrix multiply Normalization WTA-hash Cuckoo Lookup Construct candidate list Construct candidate embedding matrix Matrix multiply Normalization Figure 5.2: Comparison of the pipeline of full vocabulary softmax, LSH softmax on CPU proposed in Vijayanarasimhan et al. (2015), and our LSH softmax on GPU. Every step of full vocabulary softmax and our LSH softmax on GPU is executed in batch mode, whereas the steps inside the grey box of LSH softmax on CPU are executed separately for each hidden vector in the beam. For each beam entry, we are only interested in the topB words according to the probability/logit. We can reduce the complexity down toO(dBV 0 )(jV 0 jjVj) if we can inexpensively construct a much smaller vocabulary setV 0 that also contains the top B words. p(y = jjH i ) is proportional to the dot-product between H i and E j . Thus, nding the topB words with highest probability is equivalent to nding the nearest neighbors ofH i from all embedding vectorsE k ;8k = 1:::jVj under the dot-product distance measure. LSH (Gionis et al., 1999) is an ecient tool for the nearest neighbor problem. LSH will construct a small candidate vocabulary set V LSH which will contains the top B words with a high expectation. 5.4.1 LSH on CPU Vijayanarasimhan et al. (2015) successfully utilize winner-take-all (WTA) LSH to speed up the softmax calculation on CPU: 80 1. Hash every word embedding E j 2R d into hash codeWTA(E j )2Z W where W is the dimension of the hash space. Organize these hash codes in hash tables. 2. For each hidden vector H i 2 R d , apply the same hash function to get WTA(H i ). 3. Given WTA(H i ), select the top K collisions in the hash table, to construct the candidate vocabulary set V LSH;i . 4. Calculate the dot-product H i E k ;8k2V LSH;i . 5. Repeat step 2-4 for each entry in the batch. The tasks Vijayanarasimhan et al. (2015) that sped up are Image Classication, Skipgram Word2Vec, and Video Identication, which all involve only one step of the softmax calculation. When conducting inference in batch mode, the test entries inside a batch can be very dierent, thus the candidate list V LSH;i will dier a lot. Therefore, step 2-4 must be executed independently for each entry in the batch. This will lead to less speedup when batch size is large: Vijayanarasimhan et al. (2015) reports that speedup decreases from 6.9x to 1.6x when batch size increase from 8 to 64. This is problematic as beam size could be more than 50 for certain task. 5.4.2 LSH on GPU A similar reduction in speedup will also happen on GPU, especially because GPUs prefer large-scale calculation to hide latency, as described in Section 5.3.3. Table 5.3 shows a comparison of matrix multiplications at dierent scales. Even though cal- culation is reduced to 1/12th, the time spent is only shrunk to one third. 81 (m,k,n): A m;k B k;n Time (ms) G op/s (12,1000,50000) 5.12 234.58 (1,1000,50000) 1.63 61.27 Table 5.3: The time consumption and oating point operations per second(G op/s) of matrix multiplication on GPU at dierent scales. On GPU, another drawback of processing each beam one by one, is that we must rst construct embedding matrix E LSH;i that occupies continuous space in global memory to do matrix multiplication. This expensive data movement can further downgrade the speedups. To solve this problem, we propose to share the candidate vocabulary setV LSH = [ B i=1 V LSH;i across dierent entries in the batch. Only one embedding matrixE LSH will be constructed and the following matrix multiplication HE LSH will be cal- culated in batch mode in a single kernel launch. This idea is motivated by the intuition that during beam search, at each step, dierent V LSH;i will have a big portion of words in common, thusjV LSH j < P B i=1 jV LSH;i j. Although the amount computation in matrix multiplication will increase (dBjV LSH j > P B i=1 djV LSH;i j), it makes every step of our LSH algorithm on GPU executed in batch mode, saving a lot of time in practice. Another issue that beam search poses is that the error will accumulate fast as the sentence is built. Therefore, missing a correct word at a certain step will lead to a catastrophe. In practice, we nd that even the combined candidate list V LSH can miss out some important words. We solve this problem by further merging the top T frequent words into the candidate list: V LSH =V T [ ([ B i=1 V LSH;i ) (5.3) 82 Full Vocab (ms) Percent LSH (ms) Percent Naive slowdown Softmax 120.97 100.0 % 44.09 100.0 % { LSH overhead - 16.53 37.5 % { { WTA-hash - 3.72 8.4 % { { Cuckoo lookup - 7.29 16.5 % 3.4x { { Construct candidate list - 2.51 5.7 % 1.7x { { Construct E LSH - 3.01 6.8 % { Matrix multiply 108.16 89.4 % 22.43 50.9 % { Normalization 12.33 10.2 % 2.74 6.2 % Runtime vocab size 40,000 6,177 Table 5.4: The runtime vocabulary size and time breakdown of each step of full vocabulary decoding and our LSH decoding on translating a French sentence to an English sentence with beam size 12. The last column means the slowdown if the corresponding optimized step is replaced by a naive implementation. Figure 5.2 illustrates the detailed pipeline of our LSH algorithm on GPU. Table 5.4 shows the time breakdown of each step in LSH softmax. Although every step is running nicely in batch mode, we still need to carefully design each step as a naive implementation leads to large overhead on GPU. The naive imple- mentation of cuckoo lookup and construct candidate list experience 3.4x and 1.7x slowdown respectively, compared to our optimized version. Winner-Take-All hashing Following Vijayanarasimhan et al. (2015), we use the winner-take-all (WTA) (Yag- nik et al., 2011) hashing function, which can be formally dened in the following equations: WTA(H2R d ) = [I 1 ;:::;I p ;:::;I P ] (5.4) I p = arg max K k=1 Permute p (H)[k] (5.5) 83 WTA will convert ad-dimension real value hidden vector into aP -dimension int value hash code. EachI p is the index of the maximum value of the rstK elements of the permuted H. Thus, the hash code is actually an ordinal embedding of the original hidden vector, and we use the ordinal similarity as a proxy for the dot- product similarity. We can further group these hash codes into bands, and convert into a W -dimension band code: WTA band (H) = [B 1 ;:::;B w ;:::;B W ] (5.6) B w = [I (w1)u+1 ;:::;I (w1)u+i ;:::;I wu ] (5.7) u =P=W (5.8) where each band code B w is the concatenation of u hash codes. Table 5.5 shows an example of WTA hash. We can represent B w in u log 2 (K) bits, and we make sureu log 2 (K)< 31 so that we can store each band code using an int32 on GPU. The hyper parameters for WTA hash arefK;u;Wg. H 0.32 0.48 -0.57 0.63 Permute 0 1 2 4 3 Permute 1 1 3 2 4 Permute 2 3 2 4 1 Permute 3 4 1 2 3 I 1 1! (00) 2 I 2 2! (01) 2 I 3 2! (01) 2 I 4 1! (00) 2 B 1 (0001) 2 B 2 (0100) 2 Table 5.5: The running example of WTA hash with W = 2, u = 2 and K = 2. 84 1 3 4 6 2 5 1 2 5 3 4 6 key = 0100 0101 0001 key = 0010 0001 Input 0 1 0 0 :(1 , 3 ) 0101:(4,1) 0001:(5,2) 0010:(1,4) 0 0 0 1 :(5 , 2 ) 0100 0001 0 0 2 1 2 0 1 2 3 4 5 6 Word ID: Output Figure 5.3: Example of cuckoo lookup. The beam size is 1, W = 2 andjVj = 6. Cuckoo lookup Given WTA band (H)2Z BW , this step will calculate the hit matrix L2Z BjVj , where L[i;j] = W X w=1 I(WTA band (H i )[w] =WTA band (E j )[w]) (5.9) L[i;j] counts how many band codes are the same between WTA band (H i ) and WTA band (E j ), which is acted as an estimate of the dot-product similarity between H i and E j . First, we hash all word embeddingsWTA band (E j ). For each band, we will build a hash table: T w =fband code : [word id 1 ;:::word id n ]g where the key is the band code and the value is a list containing all the words whose band code is equal to the key. 85 We re-organize T w into an at array on GPU: word ids with same band code are stored in continuous span, and we build a cuckoo hash table for each T w to store the starting position and corresponding length of each span: CuckooT w =fband code : (start;length)g Second, we launch a total of BW GPU threads to calculate L, and each thread follows Algorithm 2. To look up certain key in cuckoo hash table, it hashes and compares the key at most twice. Thus the warp(a group of 32 threads) won't diverge at line 2. However, at line 3, because dierent threads will have dierent length values, the execution time of the warp will depend on the largest length. There will be a serious warp divergence at line 3-6. To solve this problem, we re-arrange the execution order so that the 32 threads will rst process the for-loop of thread 0 together, then the for-loop of thread 1 together, until that of thread 31 . Such re- arrangement will speed up this step by 3.4x. Figure 5.4 illustrates the two dierent thread arrangements. Algorithm 2 Cuckoo lookup Inputs: T , CuckooT , WTA band (H) beam index i, band index w Output: L 1: code = WTA band (H i )[w] 2: start, length = CuckooT w [code] 3: for pos = start to start + length do 4: word id = T w [pos] 5: L[i,word id] += 1 6: end for 86 Figure 5.4: Illustration of naive implementation and optimized implementation of line 3-6 in Algorithm 2. We assume each warp contains 4 threads, and their for-loop lengths are 1, 7, 3, and 2. The round grey rectangle represents one step of a warp. (a) The naive implementation, which takes the warp 7 steps to nish. (b) The optimized implementation, which takes only 5 steps. Construct candidate list Given the hit matrix L2 Z BjVj and a threshold t, this step selects the nal candidate vocabulary set V LSH , where: j2V LSH () 9i;s:t:L[i;j]>=t (5.10) We use threshold to avoid the inecient sorting on GPU. L is a sparse matrix after ltering with t, whereas V LSH should be a dense array. This is the canonical Stream Compaction problem, and one can simply use copy if function provided in thrust library. To further improve the eciency, we re-design the algorithm by 87 taking advantage of shared memory and coalesced access. The new algorithm is illustrated in Figure 5.5. Figure 5.5: Illustration of optimized stream compaction algorithm. We assume each warp contains 4 threads here. 2 warps will rst load L into shared memory in a coalesced read. Then only the rst thread of each warp will scan the 4 values and lter out the valid word ID. Then each warp will write the valid word ID back in a coalesced write. The start position in V LSH for each warp is maintained in global memory, omitted here. Hyper-parameters that dene a WTA LSH beam search arefK;u;W ;B;T;tg, whereB is beam size,T is the number of top frequent words to merge andt is the threshold to select V LSH . 5.5 Experiment We conduct our experiment on 4 machine translation models: Japanese to English (J2E), English to Japanese (E2J), French to English (F2E) and Uzbek to English (U2E). The statistics and training parameters are shown in Table 5.6. Overall speedup We compare our algorithm with two other decoding acceler- ation methods: Decoding using only the top frequent words (TOP) and decoding with word alignments (WA) (Shi and Knight, 2017). We conduct a grid search 88 J2E E2J F2E U2E jV source j 80K 88K 200K 50K jV target j 50K 66K 40K 25K #Tokens 70.4M 70.4M 652M 3.3M Attention Yes Yes No Yes Table 5.6: Training congurations of dierent language pairs. The attention model is based on Luong et al. (2015). Data sources: ASPEC Japanese-English Corpus (Nakazawa et al., 2016), French-English Corpus from WMT2014 (Bojar et al., 2014), and Uzbek-English Corpus (Consortium, 2016). of the LSH hyper parametersfK;u;Wg and nd thatf8; 3; 500g andf16; 3; 500g generally deliver good performance/speedup trade-o. Figure 5.6 shows the BLEU/speedup curve of the three decoding methods on 4 translation directions. Our LSH decoding can achieve a consistent 2x overall speedup for J2E, E2J and F2E with tiny BLEU score loss. For U2E, the speedup without BLEU loss is around 1.7x, due to the small original target vocabulary size (25,000). Our LSH decoding always obtains a higher BLEU with a large margin than TOP decoding at the same speedup level. Table 5.4 shows that even the optimized LSH overhead can take up to 37.5% of the total time to calculate softmax. To achieve the same speedup with TOP decoding, the runtime vocabulary size of LSH is only half of that of TOP decoding, which demonstrates that our LSH algorithm indeed selects a smaller yet more accurate vocabulary set. The WA decoding can achieve an even higher speedup with the same BLEU. However, this approach can only work in the context where a sensible alignment information can be provided. Eects of beam size Table 5.7 shows the speedup and BLEU loss of our LSH decoding over the full vocabulary decoding at dierent beam size (batch size). Unlike the algorithm proposed by Vijayanarasimhan et al. (2015), our algorithm 89 1.0 1.5 2.0 2.5 3.0 Speedup 21 22 23 24 25 26 BLEU J2E 1.0 1.5 2.0 2.5 3.0 3.5 Speedup 30 32 34 BLEU E2J 1.0 1.5 2.0 2.5 3.0 3.5 Speedup 20 22 24 26 28 30 BLEU F2E 1.0 1.5 2.0 2.5 Speedup 8 9 10 11 12 13 BLEU U2E Full softmax TOP WA LSH Figure 5.6: BLEU/speedup curve of 3 decoding methods on 4 translation direc- tions. will maintain or even obtain a higher speedup with the same level of BLEU when beam size increases, which can be further explained by that large batch size will saturate the GPU and fully exploit its parallel power. Eects of T The merge of top T frequent words into the candidate word list V LSH is necessary to obtain good performance. Figure 5.7 shows the BLEU/speedup curve for dierentT on the French-to-English translation task. HavingT too small 90 Beam size Speedup BLEU loss 12 2.06 0.31 24 2.22 0.35 36 2.23 0.28 48 2.21 0.31 Table 5.7: The speedup and BLEU loss of LSH decoding over full-vocabulary decoding at dierent beam sizes on French-to-English translation. will result in low BLEU whereas havingT too large will limit the highest speedup the method can achieve. 1.0 1.5 2.0 2.5 3.0 Speedup 10 15 20 25 30 BLEU F2E K=16, u=3, W=500 Full softmax BLEU T=100 T=500 T=1000 T=2000 T=5000 Figure 5.7: The BLEU/speedup curve for dierent T on French to English trans- lation model. 91 5.6 Conclusion We re-design the LSH algorithm for beam search on GPU. The candidate vocab- ulary set V LSH is shared across the beams to execute every step in batch mode. Several key functions are optimized by using a cuckoo hash table, taking advan- tage of shared memory, and avoiding warp divergence. Top frequent words are merged into V LSH to further improve the performance. Our LSH algorithm is a machine-learning-free acceleration method that achieves 2x speedup on 4 machine translation tasks, and delivers better BLEU/speedup trade-o than TOP decoding. 92 Chapter 6 Augmentation: Constrain the Beam Search with Finite State Acceptor In the previous two chapters, we addressed one of the challenges of decoding: to speed up the beam search by shrinking the runtime target vocabulary. In this chap- ter, we focus on another challenge of decoding: to constrain the beam search with external knowledge, and by manipulating the runtime target vocabulary. Many text generation tasks require the output to follow certain format. For example, to generate a poem, the output text should follow certain meter and rhyme pat- terns. However, the neural sequence model may not learn these constraints, largely because there's no large training data of poems. Thus, we need another assistant model to learn the pronunciation of each English words and keep track of the pen- tameters. In this chapter, we focus on the task of poem generation, and integrate a Finite State Acceptor (FSA) into beam search during decoding, where the RNN is responsible for syntactic and semantic coherence of the output, and FSA keeps track of the pentameter and rhyme patterns. 1 1 Work described in this chapter is mostly joint work with Marjan Ghazvininejad 93 6.1 Introduction Automatic algorithms are starting to generate interesting, creative text, as evi- denced by recent distinguishability tests that ask whether a given story, poem, or song was written by a human or a computer. 2 In this chapter, we describe Hafez, a program that generates any number of distinct poems on a user-supplied topic. Figure 6.1 shows an overview of the system, which sets out these tasks: Vocabulary. We select a specic, large vocabulary of words for use in our generator, and we compute stress patterns for each word. Related words. Given a user-supplied topic, we compute a large set of related words. Rhyme words. From the set of related words, we select pairs of rhyming words to end lines. Finite-state acceptor (FSA). We build an FSA with a path for every conceiv- able sequence of vocabulary words that obeys formal rhythm constraints, with chosen rhyme words in place. Path extraction. We select a uent path through the FSA, using a recurrent neural network (RNN) for scoring. 6.2 Related Work Automated poem generation has been a popular but challenging research topic (Manurung et al., 2000; Gervas, 2001; Diaz-Agudo et al., 2002; Manurung, 2003; 2 For example, in the 2016 Dartmouth test bit.ly/20WGLF3, no automatic sonnet-writing system passed indistinguishability, though ours was selected as the best of the submitted systems. 94 Figure 6.1: Overview of Hafez converting a user-supplied topic word (wedding) into a four-line iambic pentameter stanza. 95 Wong and Chun, 2008; Jiang and Zhou, 2008; Netzer et al., 2009). Recent work attempts to solve this problem by applying grammatical and semantic templates (Oliveira, 2009, 2012), or by modeling the task as statistical machine translation, in which each line is a \translation" of the previous line (Zhou et al., 2009; He et al., 2012). Yan et al. (2013) proposes a method based on summarization techniques for poem generation, retrieving candidate sentences from a large corpus of poems based on a user's query and clustering the constituent terms, summarizing each cluster into a line of a poem. Greene et al. (2010) use unsupervised learning to estimate the stress patterns of words in a poetry corpus, then use these in a nite- state network to generate short English love poems. Several deep learning methods have recently been proposed for generating poems. Zhang and Lapata (2014) use an RNN model to generate 4-line Chinese poems. They force the decoder to rhyme the second and fourth lines, trusting the RNN to control rhythm. Yi et al. (2016) also propose an attention-based bidirec- tional RNN model for generating 4-line Chinese poems. The only such work which tries to generate longer poems is from Wang et al. (2016), who use an attention- based LSTM model for generation iambic poems. They train on a small dataset and do not use an explicit system for constraining rhythm and rhyme in the poem. Novel contributions include: We combine nite-state machinery with deep learning, guaranteeing formal correctness of our poems, while gaining coherence of long-distance RNNs. By using words related to the user's topic as rhyme words, we design a system that can generate poems with topical coherence. This allows us to generate longer topical poems. 96 6.3 Selecting Topically Related Rhyme Pairs In this section, I will brie y describe the three steps to select topically related rhyme pairs given any topic word or phrase. Please refer to our work (Ghazvinine- jad et al., 2016) for details, as most technologies involved are out of the scope of this chapter. 6.3.1 Vocabulary To generate a line of iambic pentameter poetry, we arrange words to form a sequence of ten syllables alternating between stressed and unstressed. For example: 010 1 0 10 101 Attending on his golden pilgramage To get stress patterns for individual words, we use CMU pronunciation dic- tionary, 3 . To guarantee that our poems scan properly, we eject all ambiguous words, whose stress patterns depend on context, from our vocabulary. Our nal vocabulary contains 14,368 words (4833 monosyllabic and 9535 multisyllabic). 6.3.2 Related Words and Phrases After we receive a user-supplied topic, the rst step in our poem generation algo- rithm is to build a scored list of 1000 words/phrases that are related to that topic. For example: User-supplied input topic: colonel Output: colonel (1.00), lieutenant colonel (0.77), brigadier general (0.73), commander (0.67) ... army (0.55) ... 3 http://www.speech.cs.cmu.edu/cgi-bin/cmudict 97 To nd the related words/phrases, we train a continuous-bag-of-words word2vec model(Mikolov et al., 2013) 4 with window size 40 and word vector dimension 200 using the rst billion characters from Wikipedia. 5 . We score candidate related words/phrases with cosine to topic-word vector. 6.3.3 Choosing Rhyme Words In a Shakespearean sonnet with rhyme scheme ABAB CDCD EFEF GG, there are seven pairs of rhyme words to decide on. We consider both strict rhyme and slant rhyme pairs. We rst hash all related words/phrases into rhyme classes, then any two words/phrases within the same rhyme class will form a candidate rhyme pair(s1, s2), whose score equals maxfcosine(s1;topic);cosine(s2;topic)g. We randomly select 7 rhyme pairs according their score. 6.4 Constructing FSA of Possible Poems After choosing rhyme words, we create a large nite-state acceptor (FSA) that compactly encodes all word sequences that use these rhyme words and also obey formal sonnet constraints: Each sonnet contains 14 lines. Lines are in iambic pentameter, with stress pattern (01) 5 . Following poetic convention, we also use (01) 5 0, allowing feminine rhyming. Each line ends with the chosen rhyme word/phrase for that line. 4 https://code.google.com/archive/p/word2vec/ 5 http://mattmahoney.net/dc/enwik9.zip 98 Figure 6.2: An FSA compactly encoding all word sequences that obey formal sonnet constraints, and dictating the right-hand edge of the poem via rhyming, topical words delight, chance, ... and joy. Each line is punctuated with comma or period, except for the 4th, 8th, 12th, and 14th lines, which are punctuated with period. To implement these constraints, we create FSA states that record line number and syllable count. For example, FSA state L2-S3 (Figure 6.2) signies \I am in line 2, and I have seen 3 syllables so far". From each state, we create arcs for each feasible word in the vocabulary. For example, we can move from state L1-S1 to state L1-S3 by consuming any word with stress pattern 10 (such as table or active). When moving between lines (e.g., from L1-S10 to L2-S1), we employ arcs labeled with punctuation marks. To x the rhyme words at the end of each line, we delete all arcs pointing to the line-nal state, except for the arc labeled with the chosen rhyme word. For speed, we pre-compute the entire FSA; once we receive the topic and choose rhyme words, we only need to carry out the deletion step. 99 In the resulting FSA, each path is formally a sonnet. However, most of the paths through the FSA are meaningless. One FSA generated from the topic natural language contains 10 229 paths, including this randomly-selected one: Of pocket solace ammunition grammar. An tile pretenders spreading logical. An stories Jackie gallon posing banner. An corpses Kato biological ... Hence, we need a way to search and rank this large space. 6.5 Path extraction through FSA with RNN To locate uent paths, we need a scoring function and a search procedure. For example, we can build a n-gram word language model (LM)|itself a large weighted FSA. Then we can take a weighted intersection of our two FSAs and return the highest-scoring path. While this can be done eciently with dynamic program- ming, we nd that n-gram models have a limited attention span, yielding poor poetry. Instead, we use an RNN language model (LM). We collect 94,882 English songs (32m word tokens) as our training corpus, 6 and train 7 a two-layer recurrent net- work with long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997b). 8 6 http://www.mldb.org/ 7 We use the toolkit: https://github.com/isi-nlp/Zoph RNN 8 We use a minibatch of 128, a hidden state size of 1000, and a dropout rate of 0.2. The output vocabulary size is 20,000. The learning rate is initially set as 0.7 and starts to decay by 0.83 once the perplexity on a development set starts to increase. All parameters are initialized within range [0:08; +0:08], and the gradients are re-scaled when the global norm is larger than 5. 100 When decoding with the LM, we employ a beam search that is further guided by the FSA. Each beam state C t;i is a tuple of (h;s;word;score), where h is the hidden states of LSTM at step t in ith state, and s is the FSA state at step t in ith state. The model generates one word at each step. At the beginning,h 0;0 is the initial hidden state of LSTM,s 0;0 is the start state of FSA, word 0;0 = <START> and score 0;0 = 0. To expand a beam state C t;i , we rst feedh t;i andword into the LM and get an updated hidden stateh next . The LM also returns a probability distribution P (V ) over the entire vocabularyV for next word. Then, for each succeeding state s suc of s t;i in the FSA and the word w next over each edge from s t;i to s suc , we form a new state (h next ;s suc ;w next ;score t;i + log(P (w next ))) and push it into next beam. Because we x the rhyme word at the end of each line, when we expand the beam states immediately before the rhyme word, the FSA states in those beam states have only one succeeding state|LN-S10, where N = [1; 14], and only one succeeding word, the xed rhyme word. For our beam size b = 50, the chance is quite low that in those b words there exists any suitable word to precede that rhyme word. We solve this by generating the whole sonnet in reverse, starting from the nal rhyme word. Thus, when we expand the state L1-S8, we can choose from almost every word in vocabulary instead of just b possible words. The price to pay is that at the beginning of each line, we need to hope in thoseb words there exists some that are suitable to succeed comma or period. Because we train on song lyrics, our LM tends to generate repeating words, like never ever ever ever ever. To solve this problem, we apply a penalty to those words that already generated in previous steps during the beam search. To create a poem that ts well with the pre-determined rhyme words at the end of each line, the LM model tends to choose \safe" words that are frequent and 101 suitable for any topic, such as pronouns, adverbs, and articles. During decoding, we apply a reward on all topically related words (generated in Section 4) in the non-rhyming portion of the poem. Finally, to further encourage the system to follow the topic, we train an encoder- decoder sequence-to-sequence model (Sutskever et al., 2014a). For training, we select song lyric rhyme words and assemble them in reverse order (encoder side), and we pair this with the entire reversed lyric (decoder side). At generation time, we put all the selected rhyme words on the source side, and let the model to generate the poem conditioned on those rhyme words. In this way, when the model tries to generate the last line of the poem, it already knows all fourteen rhyme words, thus possessing better knowledge of the requested topic. We refer to generating poems using the RNN LM as the \generation model" and to this model as the \translation model". 6.6 Style Control When poets compose a poem, they usually need to revise and polish the draft from dierent aspects (e.g., word choice, sentiment, alliteration, etc.) for several itera- tions until satisfaction. This is a crucial step for poetry creation. Thus, we add additional weights during decoding to control the style of generated poem, includ- ing the extent of words repetition, alliteration, word length, cursing, sentiment, and concreteness. During the RNN's beam search, each beam cell records the current FSA state s. Its succeeding state is denoted as s suc . All the words over all the succeeding states forms a vocabulary V suc . To expand the beam state b, we need to calculate a score for each word in V suc : 102 score(w;b) = score(b) + logP RNN (w) + X i w i f i (w);8w 2 V suc (6.1) where logP RNN (w) is the log-probability of word w calculated by RNN. score(b) is the accumulated score of the already-generated words in beam state b . f i (w) is ith feature function and w i is the corresponding weight. To control the style, we design the following 8 features: 1. Encourage/discourage words. User can input words that they would like in the poem, or words to be banned. f(w) =I(w;V enc=dis ), where I(w;V ) = 1 if w is in the word list V , otherwise I(w;V ) = 0. w enc = 5 and w dis =5. 2. Curse words. We pre-build a curse-word list V curse , and f(w) =I(w;V curse ). 3. Repetition. To control the extent of repeated words in the poem. For each beam, we record the current generated words V history , and f(w) = I(w;V history ). 4. Alliteration. To control how often adjacent non-function words start with the same consonant sound. In the beam cell, we also record the previous generated word w t1 , and f(w t ) = 1 if w t and w t1 share the same rst consonant sound, otherwise it equals 0. 5. Word length. To control a preference for longer words in the generated poem. f(w) =length(w) 2 . 6. Topical words. For each user-supplied topic words, we generate a list of related words V topical . f(w) =I(w;V topical ). 103 7. Sentiment. We pre-build a word list together with its sentiment scores based on SentiWordNet (Baccianella et al., 2010). f(w) equals to w 0 s sentiment score. 8. Concrete words. We pre-build a word list together with a score to re ect its concreteness based on Brysbaert et al. (2014). f(w) equals to w's concrete- ness score. 6.7 Speedup Generating a poem may require a heavy search procedure. Slow speed is a serious bottleneck for a smooth user experience, and prevents the large-scale collection of feedback for system tuning. Thus, we increase speed by pre-calculation, pre- loading model parameters, and pruning the vocabulary. We also parallelize the computation of FSA expansion, weight merging, and beam search, and we port them into a GPU. Overall, we can generate a four-line poem within 2 seconds, ten times faster than our previous CPU-based system. To nd the rhyming words related to the topic, we employ a word2vec model. Given a topic word or phrasew t 2V , we nd related wordsw r based on the cosine distance: w r = arg max wr2V 0 V cosine(e wr ;e wt ) (6.2) where e w is the embedding of word w. Then we calculate the rhyme type of each related word w r to nd rhyme pairs. To speed up this step, we carefully optimize the computation with these meth- ods: 104 1. Pre-load all parameters into RAM. As we are aiming to accept arbitrary topics, the vocabulary V of word2vec model is very large (1.8M words and phrases). Pre-loading saves 3-4 seconds. 2. Pre-calculate the rhyme types for all words w2V 0 . During runtime, we use this dictionary to lookup the rhyme type. 3. Shrink V'. As every rhyme word/phrase pairs must be in the target vocabu- lary V RNN of the RNN, we further shrink V 0 =V\V RNN : To speedup the RNN decoding step, we use GPU processing for all forward- propagation computations. For beam search, we port to GPU the two most time- consuming parts, calculating scores with Equation 6.1 and nding the top words based the score: 1. We warp all the computation needed in Equation 6.1 into a single large GPU kernel launch. 2. With beam size B, to nd the top k words, instead of using a heap sort on CPU with complexityO(BjV suc jlogk), we do a global sort on GPU with complexityO(BjV suc jlog(BjV suc j)) in one kernel launch. Even though the complexity increases, the computation time in practice reduces quite a bit. Finally, our system can generate a 4-line poem within 2 seconds, which is 10 times faster than the previous CPU-based version. 105 Bipolar Disorder Existence enters your entire nation. A twisted mind reveals becoming manic, An endless modern ending medication, Another rotten soul becomes dynamic. Or under pressure on genetic tests. Surrounded by controlling my depression, And only human torture never rests, Or maybe you expect an easy lesson. Or something from the cancer heart disease, And I consider you a friend of mine. Without a little sign of judgment please, Deliver me across the borderline. An altered state of manic episodes, A journey through the long and winding roads. Figure 6.3: Sample sonnet generated from the topic phrase bipolar disorder. 6.8 Experiments 6.8.1 Quality analysis Sample outputs produced by our best system are shown in Figures 6.3 and 6.4. We nd that they generally stay on topic and are fairly creative. If we request a poem on the topic Vietnam, we may see the phrase Honky Tonkin Resolution; a dierent topic leads the system to rhyme Dirty Harry with Bloody Mary. 6.8.2 Human-Computer Collaboration In this experiment, we design an Amazon Mechanical Turk task to explore the eect of style options. Turkers will rst use our system to generate a default poem on an arbitrary topic with the default style conguration, and rate it. Next, they are asked to adjust the style congurations to re-generate at least ve dierent adjusted poems with the same rhyme words, and rate them as well. Improving the 106 Love at First Sight An early morning on a rainy night, Relax and make the other people happy, Or maybe get a little out of sight, And wander down the streets of Cincinnati. Girlfriend Another party started getting heavy. And never had a little bit of Bobby, Or something going by the name of Eddie, And got a nger on the trigger sloppy. Noodles The people wanna drink spaghetti alla, And maybe eat a lot of other crackers, Or sit around and talk about the salsa, A little bit of nothing really matters. Civil War Creating new entire revolution, An endless nation on eternal war, United as a peaceful resolution, Or not exist together any more. Figure 6.4: Sample stanzas generated from dierent topic phrases. quality of adjusted poems over the default poem is not required for nishing the task, but it is encouraged. This experiment tests whether human collaboration can help Hafez generate better poems. In only 10% of the HITs, the reported best poem was generated by the default style options, i.e., the default poem. Additionally, in 71% of the HITs, users assign a higher star rating to at least one of the adjust poems than the default poem. On average the best poems got +1:4 more stars compared to the default one. However, poem creators might have a tendency to report a higher ranking for poems generated through the human/machine collaboration process. To sanity check the results we designed another task and asked 18 users to compare the 107 default and the reported best poems. This experiment seconded the original rank- ings in 72% of the cases. 6.9 Conclusion We have described Hafez, a poetry generation system that combines hard format constraints with a deep-learning recurrent network. It enables users to generate poems about any topic, and revise generated texts through multiple style cong- urations. 108 Chapter 7 Augmentation: Neural Machine Translation for Low-resource Languages In the previous three chapters, we mainly focused on the decoding phase of the neural sequence model: speed up beam search by shrinking the runtime vocabulary, and apply constraints with Finite State Acceptors. In this chapter, we start to touch the training phase of the neural sequence model, and specically, improve the neural machine translation in the scenario where there is limited training data. 7.1 Introduction Neural machine translation (NMT) has been demonstrated as an eective approach both in academia and industry (Wu et al., 2016; He, 2015). However, one of the biggest challenges to achieve good translation performance is to train on a large size of parallel corpora, which is usually hard to get, especially for the languages with a small population. Table 7.1 shows the BLEU scores of a neural machine translation system and a non-neural syntax-based machine translation system (Galley et al., 2004, 2006), both trained from Uyghur to English. The training corpus contains only about 6.48 million tokens, whereas an adequate training corpus usually contains more than 100 109 Train size NMT Non-NMT Uyg2Eng 6.48 M 10.6 21.6 (Syntax-based) Ger2Eng 226 M 20.9 20.7 (Phrase-based) Table 7.1: The BLEU score of neural MT and syntax-based MT system from Uyghur to English and German to English (Luong et al., 2015) million tokens. The neural machine translation system has lost its power on such small training data, achieving only 10.6 BLEU score, whereas the syntax-based translation system achieves 21.6. Koehn and Knowles (2017) conduct experiment on Spanish to English using corpus that ranges from 0.4 million tokens to 385.7 million tokens. The experiments show that the BLEU score of NMT is much lower than phrase-based machine translation (PBMT) with small corpus, and matches up with PBMT when trained on about 15 million tokens. There are multiple reasons that could explain the decit of NMT when trained on small corpus: Out-of-vocabulary (OOV) problem. The vocabulary on source side and target side can only derive from the training data. The smaller the training data, the larger chance that certain words from the test set are not in the vocabulary. The problem becomes more serious for the morphologically rich language, as it contains more in ections. Overtting problem. The neural machine translation models are known for its superior expression power, due to the huge amount of internal param- eters. Thus it's more easy to overt to smaller dataset, leading to bad trans- lation quality. Rare word selection problem. Besides the out-of-vocabulary words, there are still a big portion of words that appears very few times. During the train- ing procedure, the corresponding parameters of these rare words are merely 110 updated since its initialization. It's a general problem for NMT, but when dealing with low resource languages and morphologically rich languages, the problem deteriorates. Inadequate translations. Experiments show that NMT tends to generate shorter translations when trained on smaller corpus. It will drop certain concepts of the source sentences thus lead to inadequate translations. Lots of research ideas have been proposed to tackle each of the above-mentioned challenges. In this work, we re-examine these techniques under the low resource scenario, explore their variations and stack them together to build a stronger neural machine translation system. The recipe of our nal system includes the following ingredients: To solve the OOV problem, we utilize the sub-words translation via byte pair encoding (BPE). In contrast to the usual practice, which applies BPE to both source and target side vocabulary, we nd it more ecient and eective to apply BPE only to the morphologically rich language. To relieve the overtting problem, we tie the input word embedding and output word embedding on the target side. We explore multiple variations of dropout technicals and surprisingly nd that dropout on both non-recurrent connections and recurrent connections of the multi-layer recurrent neural network (RNN) can boost the BLEU score by a large margin. We also utilize the layer normalization (Ba et al., 2016) which shows faster and better convergence. To better handle rare words, we follow Nguyen and Chiang (2017)'s idea to normalize the hidden states of right before the inner product with output 111 word embedding matrix. We also explore multiple strategies of unknown words replacement, the improvement is tiny due to the general low quality translation and attention. To improve the adequacy of translations, we rescale the log probability during the beam search with length normalization and coverage penalty (Wu et al., 2016). We conduct our experiments on a small Uygher to English corpus, and achieve 10.4 BLEU score improvement over the vanilla NMT system, which is comparable to the performance of syntax-based system. 7.2 Sub-word Translation Dierent from phrase-based or syntax-based machine translation systems, neural machine translation system requires to list the source and target vocabulary before- hand. Word-level NMT usually selects the top frequent 30,000 to 50,000 words as vocabulary set and map all the other words to a special symbol `UNK'. For mor- phologically rich languages, even for a small corpus, the number of distinct word types is much larger than 50,000. Thus there will be many word tokens mapped to `UNK' in the training data. One solution to solve this problem is to use sub-word units as the vocabulary. Sennrich et al. (2016) propose to split the words into smaller pieces via byte pair encoding (BPE). Give a tokenized corpus, BPE will rst split each word into characters and then merge recursively the most frequent adjacent pairs into a new word piece. The vocabulary size depends on the number of merge operations. Figure 7.1 shows the ratio of non-UNK tokens of Uyghur and English train- ing/development set. As the vocabulary size reaches 40,000, 97.5% tokens in 112 10K 20K 30K 40K BPE20K Vocabulary Size 0.0 0.2 0.4 0.6 0.8 1.0 Ratio of Non-UNK Tokens Uyg Train Uyg Dev Eng Train Eng Dev Figure 7.1: The ratio of non-UNK tokens of Uyghur/English train/dev set when selecting dierent number of top frequent words as vocabulary. English training set are non-UNK tokens, whereas only 88% tokens in Uyghur side are non-UNK tokens. If we apply 20,000 BPE merge operations on both side, all the tokens on both English and Uyghur side will be within our vocabulary. The usual practice is to apply BPE on both source side and target side. How- ever, experiments show that this will lead to an even worse performance. The major reason is that because of the small corpus size, the split on the English side may not be reasonable. As the top 40,000 English types already covers about 97.5% training tokens, using word-level vocabulary is a safe and ecient choice. Thus, we only apply BPE on the Uyghur side. 7.3 Regularization for Training 7.3.1 Dropout on RNN Dropout is an ecient way to prevent overtting in neural networks. Multiple dropout techniques have been proposed for RNN. As illustrated in Figure 7.2, the 113 Figure 7.2: Depiction of three dierent dropout techniques on RNN. Each block represents a RNN unit. Vertical arrows represent feed-forward connections between layers, and Horizontal arrows represent recurrent connections between steps. The dashed arrow means a dropout mask is applied on the connection, and solid arrow means there's no dropout. Dierent color means dierent sampled dropout mask. major dierence is where the dropout happens and how the dropout masks are sampled at dierent steps. The naive dropout for RNN used in Sutskever et al. (2014b) will only dropout the feed forward connections between layers. Each dropout mask is independently sampled at dierent steps. This dropout technique has been used in most sequence to sequence NMT system. Gal and Ghahramani (2016) propose variational dropout where both the feed- forward connections and recurrent connections between steps are dropped-out. Dierent from naive dropout, the dropout masks of variational dropout at dierent steps within the same mini-batch are shared. Gal and Ghahramani (2016) has shown its superior performance on language model and sentiment analysis. If we relax the constrain of variational dropout and let the dropout masks at dierent steps get sampled independently, we get a new dropout strategy which we referred as recurrent dropout. Semeniuta et al. (2016)'s experiments have shown that recurrent dropout can outperform variational dropout on word-level language 114 modeling task. It's important to notice that if long short term memory cell (Hochreiter and Schmidhuber, 1997b) are used in RNN, the dropout on recurrent connections only happens on the hidden states, and not on the cell states. We experiment all these three dropout techniques on neural machine translation systems. Surprisingly, we nd the recurrent dropout provide a large boost on the BLEU score in the low resource scenario, whereas variational dropout even hurts the performance. Preliminary results on high resource translation system indicates that both recurrent dropout and variation dropout will hurt the training at the very early stage. 7.3.2 Layer Normalization In neural networks, the gradients with respect to the weights in one layer is highly depend on the output of previous layer, which usually changes in a highly correlated way, thus slows down the convergence speed. Layer normalization (Ba et al., 2016) reduces this problem by xing the mean and the variance of the summed input of each layer as follows: a t =W hh h t1 +W xh x t (7.1) h t =f[ g t (a t t ) +b] (7.2) t = 1 H H X i=1 a i t (7.3) t = v u u t 1 H H X i=1 (a i t t ) 2 (7.4) wherea t is the summed input before the non-activation functionf at stept. g and b are gain and bias parameters that need to learn during the training. Experiments 115 on low resource NMT shows that layer normalization not only accelerates the convergence but also signicantly boosts the performance. Beside the recurrent dropout and layer normalization, we also tie the parame- ters of input and output word embedding matrices on the target side, which also benets the nal BLEU score. 7.4 Rare Word Selection At each step during beam search, the probability of each word in the target vocab- ulary is calculated as following: P (w i jh t ) = exp(e i h t +b i ) P j exp(e j h t +b j ) (7.5) where e i is the word i's corresponding embedding and h t is the hidden state at step t. Nguyen and Chiang (2017) argues that the inner product e i h t favors the frequent word disproportionately, and suggest to x the norm of both e i and h t . In this work, we only x the norm of h t and re-parameterize h t as r ht jjhtjj , where r is a hyper-parameter that need to tune. For the `UNK' symbol generated during beam search, we locate the source word which has the highest attention score, and lookup the corresponding target word using a dictionary. If the source word is not in the dictionary, we just copy the source word into the translation. 116 # Sentence Pairs # Tokens Train 638,724 6,483,786 Dev 686 8,551 Test 347 4,341 Table 7.2: Statistics of Uyghur-English corpus 7.5 Inadequate Translation NMT tends to generate shorter translations and may drop certain concepts of the source sentence. We follow Wu et al. (2016) to re-rank the candidate translations as follows: s(Y;X) =log(P (YjX))=lp(Y ) +cp(X;Y ) lp(Y ) = (5 +jYj ) (5 + 1) cp(X;Y ) = jXj X i=1 log(min( jYj X j=1 p i;j ; 1:0)) where s(Y;X) is the score function of the source X and candidate target Y , and p i;j is the attention score of target word j on source word i. is the length normalization factor and is the coverage penalty factor. We conduct a grid search on 2 [0; 1] and 2 [0; 1] to get the best settings. 7.6 Experiment 7.6.1 Data set All our experiments are conducted on a Uygher-English parallel corpus. Table 7.2 shows the statistics of this corpus. Uyghur is a morphologically rich language, and is a left-branching language with subject-object-verb word order. 117 7.6.2 The Baseline Systems Our baseline neural machine translation system is a standard two-layer encoder- decoder model (Sutskever et al., 2014b) with long short-term memory units. We use a minibatch of 128, a hidden state size of 300 and a dropout that only applies on non-recurrent connections with 0.2 dropout rate. We train the model using Adagrad (Duchi et al., 2011) with learning rate 0.5. The model is trained for at most 80 epochs, and will stop early if the perplexity on a held-out development set doesn't decrease for 5 consecutive epochs. All parameters are re-scaled when the global norm is larger than 5. Both the source and target vocabulary sets are built by selecting the top 40,000 frequent words in each language, with the rest words mapping to a special symbol `UNK'. The maximum training sentence length is set to 100. 7.6.3 Evaluation To evaluate the translation quality, we report the detokenzied, case-sensitive NIST BLEU score Papineni et al. (2002b) against the single reference. 7.6.4 Results and Analysis Table 7.3 shows the test BLEU score of dierent translation systems trained on Uygher-English corpus. Our implementation of baseline NMT achieves similar BLEU score with that of Zoph RNN 1 . Compared to the baseline NMT system, our nal system boost the BLEU score about 10.4 points, and achieve comparable performance with the syntax-based MT. Another line of research for low resource translation is transfer learning (Zoph et al., 2016b), in which the model is rst 1 https://github.com/isi-nlp/Zoph RNN 118 BLEU Brevity Penalty Syntax-Based MT 21.6 Baseline NMT (Zoph RNN) 10.3 Transfer Learning 13.9 Baseline NMT 10.6 0.673 + Byte pair encoding on Uyghur (20K merge operations) 14.0 0.846 + Tie the input and output embedding 14.8 0.879 + Recurrent dropout (dropout rate = 0.2) 17.1 0.878 + Normalize h t (r = 3:5) 17.8 0.897 + Layer normalization 20.4 0.924 + Length normalization ( = 0:1) and coverage penalty( = 0:5) 20.9 0.984 + Unknown words replacement 21.0 0.969 Table 7.3: BLEU scores on test set of dierent Uyghur-English translation system. trained on extern large parallel corpus and then partially use the learned param- eters as the starting point to retrain on the small data. Compared to transfer learning, our nal system outperforms it by 7.1 BLEU points without using any external data. The most signicant BLEU score boosting mainly comes from three techniques: the BPE on Uyghur side (+3.4), recurrent dropout (+2.3) and layer normalization (+2.6). This indicates that for low resource translation, the biggest challenges are OOV problem and neural network regularization problem. The brevity penalty indicates the length of the generated translations. As more and more techniques are applied, the system tends to generate longer and longer sentences. However, the brevity penalty is still close to 0.9 and leaves plenty room for length normalization and coverage penalty to re-rank the candidate translations. 119 7.7 Conclusion Low resource neural machine translation faces multiple challenges includes OOV problem, overtting problem, rare word selection problem and inadequate trans- lation problem. We re-examine multiple techniques that are used in high resource NMT systems and other NLP tasks, and combine the variations of these techniques to results in a stronger NMT system for low resource languages. Our nal system achieves 10.4 BLEU improvement over the vanilla NMT system on Uyghur-English dataset, and achieve comparable results with syntax-base machine translation sys- tem. 120 Chapter 8 Conclusion In this dissertation, I aim to interpret the neural sequence model and augment it in both training and decoding phases. First, I utilize neural machine translation (NMT) as a testbed to understand the internal mechanism of recurrent neural networks (RNN). I investigate how NMT output target strings of appropriate lengths, locating a collection of hidden units learns to explicitly implement this functionality. Then I investigate whether NMT systems learn source language syntax as a by-product of training on string pairs. I nd that both local and global syntactic information is grasped by the encoder, and dierent layers stores dierent types of syntax, with dierent concentration degrees. Next, I designed two novel GPU-based algorithms to speed up the decoding: 1) Utilize word alignment information to shrink the target side run-time vocabulary; 2) Apply locality sensitive hashing to nd nearest word embeddings. Both methods lead to 2-3x speedup on four translation tasks without hurting bilingual evaluation understudy (BLEU) score. Third, I integrate the Finite State Acceptor into RNN beam search to provide a exible way to constrain the output. The integration is further applied to a poem generation system to control the meter and rhyme patterns. Fourth, I re-examine multiple technologies that are used in high resource lan- guage NMT and other NLP tasks, explore their variations and stack them together to build a stronger neural machine translation system for low resource languages. 121 Experiments on Uygher-English show 10.4 BLEU score improvement over the vanilla NMT system. 122 Reference List Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classier probes. arXiv preprint arXiv:1610.01644, 2016. Jimmy Lei Ba, Jamie Ryan Kiros, and Georey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, 2010. Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of nite state markov chains. The annals of mathematical statistics, 37(6):1554{ 1563, 1966. Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Matous Machacek, Christof Monz, Pavel Pecina, Matt Post, Herv e Saint- Amand, Radu Soricut, and Lucia Specia, editors. Proc. Ninth Workshop on Statistical Machine Translation. 2014. P. Brown, S. della Pietra, V. della Pietra, and R. Mercer. The mathematics of sta- tistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263{311, 1993. P. F. Brown, J. Cocke, S. della Pietra, V. della Pietra, F. Jelinek, J. C. Lai, and R. L. Mercer. Method and system for natural language translation, 1995. US Patent 5,477,451. Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness rat- ings for 40 thousand generally known English word lemmas. Behavior research methods, 2014. M. A. Casta~ no and F. Casacuberta. A connectionist approach to machine trans- lation. In EUROSPEECH, 1997. Eugene Charniak and Mark Johnson. Coarse-to-ne n-best parsing and MaxEnt discriminative reranking. In Proc. ACL, 2005. 123 Welin Chen, David Grangier, and Michael Auli. Strategies for training large vocab- ulary neural language models. In Proc. ACL, 2016. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase represen- tations using RNN encoder-decoder for statistical machine translation. In Proc. EMNLP, 2014. Linguistic Data Consortium. (bolt lrl uzbek representative language pack v1.0. ldc2016e29, 2016. Brooke Cowan, Ivona Ku cerov a, and Michael Collins. A discriminative model for tree-to-tree translation. In Proc. EMNLP, 2006. Thomas Dean, Mark A Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijaya- narasimhan, and Jay Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. CVPR, 2013. Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. What can syntax- based MT learn from phrase-based MT? In Proc. EMNLP-CoNLL, 2007. Belen Diaz-Agudo, Pablo Gervas, and Pedro Gonzalez-Calero. Poetry generation in COLIBRI. In Proc. ECCBR. 2002. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121{2159, 2011. Dzmitry Bahdana, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neu- ral machine translation by jointly learning to align and translate. In Proc. ICLR, 2014. Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Proc. NIPS, 2016. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What's in a translation rule? Information Sciences, 2004, 2004. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntactic translation models. In Proc. ACL, 2006. Pablo Gervas. An expert system for the composition of formal Spanish poetry. Knowledge-Based Systems, 14(3), 2001. 124 Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. Generating top- ical poetry. In Proc. EMNLP, 2016. Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. Hafez: an interactive poetry generation system. In Proc. ACL, 2017. Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518{529, 1999. Erica Greene, Tugba Bodrumlu, and Kevin Knight. Automatic analysis of rhyth- mic poetry with applications to generation and translation. In Proc. EMNLP, 2010. Michael Gutmann and Aapo Hyv arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proc. AISTATS, 2010. Jing He, Ming Zhou, and Long Jiang. Generating Chinese classical poems with statistical machine translation models. In Proc. AAAI, 2012. Zhongjun He. Baidu translate: research and products. In Proc. ACL-IJCNLP, 2015. S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. Advances in neural information processing systems, pages 473{479, 1997a. Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Com- putation, 1997b. S ebastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proc. ACL, 2015. Long Jiang and Ming Zhou. Generating Chinese couplets using a statistical MT approach. In Proc. COLING, 2008. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. In Proc. ACL, 2014. Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks. In Proc. ICLR, 2016. Yoon Kim. Convolutional neural networks for sentence classication. In Proc. EMNLP, 2014. Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proc. EMNLP, 2016. 125 Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Proc. NIPS, 2015. Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based trans- lation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology- Volume 1, pages 48{54. Association for Computational Linguistics, 2003. Jonathan K. Kummerfeld, David Hall, James R. Curran, and Dan Klein. Parser showdown at the Wall Street Corral: An empirical investigation of error types in parser output. In Proc. EMNLP-CoNLL, 2012. Jonathan K. Kummerfeld, Daniel Tse, James R Curran, and Dan Klein. An empirical examination of challenges in Chinese parsing. In Proc. ACL, 2013. John Laerty, Andrew McCallum, Fernando Pereira, et al. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. 2001. Qv Le and Tomas Mikolov. Distributed representations of sentences and docu- ments. In Proc. ICML, 2014. Gurvan L'Hostis, David Grangier, and Michael Auli. Vocabulary selection strate- gies for neural machine translation. Arxiv preprint, 2016. Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and under- standing neural models in nlp. In Proc. NAACL, 2016. Yang Liu, Qun Liu, and Yajuan L u. Adjoining tree-to-string translation. In Proc. ACL, 2011. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Eective approaches to attention-based neural machine translation. In Proc. EMNLP, 2015. Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representa- tions by inverting them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188{5196, 2015. Hisar Manurung. An evolutionary algorithm approach to poetry generation. Ph.D. thesis, University of Edinburgh., 2003. 126 Hisar Manurung, Graeme Ritchie, and Henry Thompson. Towards a computational model of poetry generation. In Proc. AISB Symposium on Creative and Cultural Aspects and Applications of AI and Cognitive Science. 2000. Mitchell P Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguis- tics, 19(2), 1993. Haitao Mi, Zhiguo Wang, and Abraham Ittycheriah. Vocabulary manipulation for neural machine translation. In Proc. ACL, 2016. Tomas Mikolov, Martin Kara at, Lukas Burget, Jan Cernock y, and Sanjeev Khu- danpur. Recurrent neural network based language model. In Interspeech, vol- ume 2, page 3, 2010. Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space. In Proc. NIPS, 2013. Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. Proc. ICML, 2012. Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network lan- guage model. In Proc. AISTATS, 2005. Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 3(4):349{381, 1989. Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. Aspec: Asian scien- tic paper excerpt corpus. In Proc. LREC, 2016. R. Neco and M. Forcada. Asynchronous translations with recurrent neural nets. In International Conf. on Neural Networks, volume 4, pages 2535{2540, 1997. Yael Netzer, David Gabay, Yoav Goldberg, and Michael Elhadad. Gaiku: Gen- erating haiku with word associations norms. In Proc. NAACL Workshop on Computational Approaches to Linguistic Creativity, 2009. Anh Nguyen, Jason Yosinski, and Je Clune. Deep neural networks are easily fooled: High condence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427{436, 2015. Toan Q Nguyen and David Chiang. Improving lexical choice in neural machine translation. arXiv preprint arXiv:1710.01329, 2017. 127 F. J. Och. Minimum error rate training in statistical machine translation. In Proc. ACL, 2003. Hugo Oliveira. Automatic generation of poetry: an overview. In Proc. 1st Seminar of Art, Music, Creativity and Articial Intelligence, 2009. Hugo Oliveira. PoeTryMe: a versatile platform for poetry generation. Computa- tional Creativity, Concept Invention, and General Intelligence, 1, 2012. Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. Journal of Algo- rithms, 51(2):122{144, 2004. K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, 2002a. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, 2002b. Slav Petrov and Ryan McDonald. Overview of the 2012 shared task on Parsing the Web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL), 2012. Sameer S Pradhan and Nianwen Xue. Ontonotes: the 90% solution. In Proc. NAACL, 2009. Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In Proc. EMNLP, 2015. Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning. 2016. Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. Recurrent dropout without memory loss. In Proc. ACL, 2016. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proc. ACL, 2016. Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Minimum risk training for neural machine translation. In Proc. ACL, 2015. Xing Shi and Kevin Knight. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proc. ACL, 2017. Xing Shi, Kevin Knight, and Deniz Yuret. Why neural translations are the right length. In Proc. EMNLP, 2016a. 128 Xing Shi, Inkit Padhi, and Kevin Knight. Does string-based neural MT learn source syntax? In Proc. of EMNLP, 2016b. Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Man- ning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. EMNLP, 2013. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, 2014a. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. NIPS, 2014b. Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London mathematical society, 2(1): 230{265, 1937. Laurens van der Maaten and Georey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9, 2008. Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. Decoding with large-scale neural language models improves translation. In Proc. EMNLP, 2013. Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. Deep networks with large output spaces. In Proc. ICLR, 2015. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Georey Hinton. Grammar as a foreign language. In Proc. NIPS, 2015. Qixin Wang, Tianyi Luo, Dong Wang, and Chao Xing. Chinese song iambics generation with neural attention-based model. arXiv:1604.06274, 2016. P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550{1560, 1990. Will Williams, Niranjani Prasad, David Mrva, Tom Ash, and Tony Robinson. Scaling recurrent neural network language models. In Proc. ICASSP, 2015. Sam Wiseman and Alexander M Rush. Sequence-to-sequence learning as beam- search optimization. In Proc. EMNLP, 2016. Martin Wong and Andy Chun. Automatic haiku generation using VSM. In Proc. ACACOS, 2008. 129 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, pages 77{81, 2015. Jay Yagnik, Dennis Strelow, David A Ross, and Ruei-sung Lin. The power of comparative reasoning. In Proc. ICCV, 2011. Rui Yan, Han Jiang, Mirella Lapata, Shou-De Lin, Xueqiang Lv, and Xiaoming Li. I, Poet: Automatic Chinese poetry composition through a generative sum- marization framework under constrained optimization. In Proc. IJCAI, 2013. Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. Generating Chinese classical poems with RNN encoder-decoder. arXiv:1604.01537, 2016. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818{833. Springer, 2014. Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6), 1989. Xingxing Zhang and Mirella Lapata. Chinese poetry generation with recurrent neural networks. In Proc. EMNLP, 2014. Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchical sentence model. In Proc. IJCAI, 2015. Ming Zhou, Long Jiang, and Jing He. Generating Chinese couplets and quatrain using a statistical approach. In Proc. Pacic Asia Conference on Language, Information and Computation, 2009. Barret Zoph, Ashish Vaswani, Jonathan May, and Kevin Knight. Simple, fast noise-contrastive estimation for large rnn vocabularies. In Proc. NAACL-HLT, 2016a. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource neural machine translation. In Proc. EMNLP, 2016b. 130
Abstract (if available)
Abstract
Recurrent neural networks (RNN) have been successfully applied to various Natural Language Processing tasks, including language modeling, machine translation, text generation, etc. However, several obstacles still stand in the way: First, due to the RNN's distributional nature, few interpretations of its internal mechanism are obtained, and it remains a black box. Second, because of the large vocabulary sets involved, the text generation is very time-consuming. Third, there is no flexible way to constrain the generation of the sequence model with external knowledge. Last, huge training data must be collected to guarantee the performance of these neural models, whereas annotated data such as parallel data used in machine translation are expensive to obtain. This work aims to address the four challenges mentioned above. ❧ To further understand the internal mechanism of the RNN, we choose neural machine translation (NMT) systems as a testbed. We first investigate how NMT outputs target strings of appropriate lengths, locating a collection of hidden units that learns to explicitly implement this functionality. Then we investigate whether NMT systems learn source language syntax as a by-product of training on string pairs. We find that both local and global syntactic information about source sentences is captured by the encoder. Different types of syntax are stored in different layers, with different concentration degrees. ❧ To speed up text generation, we propose two novel GPU-based algorithms: 1) Utilize the source/target words alignment information to shrink the target side run-time vocabulary
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Neural creative language generation
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Deep learning models for temporal data in health care
PDF
Scalable machine learning algorithms for item recommendation
PDF
Hashcode representations of natural language for relation extraction
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Modeling, learning, and leveraging similarity
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
Improved word alignments for statistical machine translation
PDF
Neural networks for narrative continuation
PDF
Learning distributed representations from network data and human navigation
PDF
Efficient pipelines for vision-based context sensing
PDF
Exploiting comparable corpora
PDF
Generating psycholinguistic norms and applications
PDF
Deciphering natural language
Asset Metadata
Creator
Shi, Xing
(author)
Core Title
Neural sequence models: Interpretation and augmentation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/01/2018
Defense Date
05/08/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
GPU,interpretation,language generation,locality sensitive hashing,neural machine translation,neural networks,OAI-PMH Harvest,poem generation,sequence-to-sequence models,speedup,word alignment
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin (
committee chair
), May, Jonathan (
committee member
), Narayanan, Shri (
committee member
)
Creator Email
shixing19910105@gmail.com,xingshi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-45523
Unique identifier
UC11668777
Identifier
etd-ShiXing-6594.pdf (filename),usctheses-c89-45523 (legacy record id)
Legacy Identifier
etd-ShiXing-6594.pdf
Dmrecord
45523
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Shi, Xing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
GPU
language generation
locality sensitive hashing
neural machine translation
neural networks
poem generation
sequence-to-sequence models
speedup
word alignment