Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Theory of memory-enhanced neural systems and image-assisted neural machine translation
(USC Thesis Other)
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THEORY OF MEMORY-ENHANCED NEURAL SYSTEMS AND
IMAGE-ASSISTED NEURAL MACHINE TRANSLATION
by
Yuanhang Su
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2019
Copyright 2019 Yuanhang Su
Contents
List of Tables iv
List of Figures v
Abstract vii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 RNN’s Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Dimension Reduction for Sequence . . . . . . . . . . . . . . . 5
1.2.3 Unsupervised Multi-modal Machine Translation (UMNMT) . . 6
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 10
2 Background Review 11
2.1 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 RNN Cell Models . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 RNN Macro Models . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Dimension Reduction for Sequence . . . . . . . . . . . . . . . . . . . 26
2.2.1 Principle Component Analysis (PCA) . . . . . . . . . . . . . . 26
2.2.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Unsupervised Multi-modal Neural Machine Translation . . . . . . . . . 30
2.3.1 Neural Multi-modal Systems . . . . . . . . . . . . . . . . . . . 30
2.3.2 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Sequence Analysis via Recurrent Neural Networks 40
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Memory Analysis of SRN, LSTM and GRU . . . . . . . . . . . . . . . 41
3.2.1 Memory of SRN . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Memory of LSTM . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Memory of GRU . . . . . . . . . . . . . . . . . . . . . . . . . 45
ii
3.3 Extended Long Short-Term Memory (ELSTM) . . . . . . . . . . . . . 47
3.3.1 Study of Scaling Factor . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Dependent BRNN (DBRNN) Model . . . . . . . . . . . . . . . . . . . 51
3.4.1 BRNN and Encoder-Decoder . . . . . . . . . . . . . . . . . . 52
3.4.2 DBRNN Model and Training . . . . . . . . . . . . . . . . . . . 55
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.2 Comparison of RNN Models . . . . . . . . . . . . . . . . . . . 61
3.5.3 Comparison between ELSTM and Non-RNN-based Methods . . 64
3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 66
4 Sequence Analysis via Dimension Reduction Techniques 68
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Tree-structured Multi-stage PCA (TMPCA) . . . . . . . . . . . . . . . 69
4.2.1 Training of TMPCA . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . 72
4.2.3 System Function . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.4 Information Preservation Property . . . . . . . . . . . . . . . . 76
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Image-assisted Neural Machine Translation 91
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Unsupervised Multi-modal Neural Machine Translation . . . . . . . . . 94
5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 Conclusion and Future Work 110
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Appendices 116
A Chapter 3 : More Results . . . . . . . . . . . . . . . . . . . . . . . . . 116
B Chapter 4 : Derivation of System Function . . . . . . . . . . . . . . . . 117
C Chapter 5 : More Visualization Results . . . . . . . . . . . . . . . . . . 118
Bibliography 126
iii
List of Tables
2.1 Activation of Neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Comparison of Parameter Numbers. . . . . . . . . . . . . . . . . . . . 48
3.2 Network parameters for the toy experiment. . . . . . . . . . . . . . . . 50
3.3 Training dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Network parameters and training details. . . . . . . . . . . . . . . . . . 60
3.5 LM test perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 DP test results (UAS/LAS %) . . . . . . . . . . . . . . . . . . . . . . 61
3.7 POS tagging test accuracy (%) . . . . . . . . . . . . . . . . . . . . . . 61
3.8 DP test accuracy (%) and system settings . . . . . . . . . . . . . . . . 66
4.1 Dimension evolution from the input to the output in the TMPCA method. 72
4.2 Subscripts ofU
s
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Selected text classification datasets. . . . . . . . . . . . . . . . . . . . 79
4.4 Parameters in dense network training. . . . . . . . . . . . . . . . . . . 83
4.5 Testing accuracy (%) of different TC models. . . . . . . . . . . . . . . 84
4.6 Comparison of training time for different models. . . . . . . . . . . . . 84
4.7 Comparison of model parameter numbers in different models. . . . . . 85
4.8 Comparison of training time in seconds (TMPCA/PCA). . . . . . . . . 86
4.9 The relative mutual information ratio (TMPCA versus Mean). . . . . . 89
4.10 The relative mutual information ratio (PCA versus TMPCA). . . . . . . 89
5.1 BLEU benchmarking. The numbers of baseline models are extracted
from the corresponding references. . . . . . . . . . . . . . . . . . . . . 103
5.2 UMNMT shows consistent improvement over text-only model across
normalized Meteor, Rouge and CIDEr metrics. . . . . . . . . . . . . . 103
5.3 BLEU for testing with TEXT ONLY input . . . . . . . . . . . . . . . 107
5.4 BLEU INCREASE ( ) UMNMT model trained on full Multi30k over
UMNMT model trained on M30k-half (Table 5.1 Row 5-8). . . . . . . . 108
6.1 Error Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Ranking of Distance: Ascending Order . . . . . . . . . . . . . . . . . . 114
6.3 Ranking of Distance: Ascending Order . . . . . . . . . . . . . . . . . . 115
B.1 Subscripts ofU
s
j
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
iv
List of Figures
1.1 NN based word embedding. . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 RNN that produces AAAB. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Clustering capability of SRN’s hidden state. . . . . . . . . . . . . . . . 14
2.3 SRN for character prediction. . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Back-propagation through time. . . . . . . . . . . . . . . . . . . . . . 15
2.5 The diagram of original LSTM cell. . . . . . . . . . . . . . . . . . . . 16
2.6 Hidden state of LSTM overflow problem. . . . . . . . . . . . . . . . . 17
2.7 Resetting of hidden state by introducing forget gate in LSTM. . . . . . 18
2.8 The diagram of a LSTM cell with forget gate. . . . . . . . . . . . . . . 19
2.9 The diagram of a GRU cell. . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 The diagram of BRNN. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 The diagram of sequence to sequence (seq2seq). . . . . . . . . . . . . . 23
2.12 Example of Deep RNN. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.13 Deep RNN Proposals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.14 Illustration of the fastText model. . . . . . . . . . . . . . . . . . . . . . 29
2.15 Multi-modal architecture for image captioning. . . . . . . . . . . . . . 30
2.16 Machine translation model. . . . . . . . . . . . . . . . . . . . . . . . . 31
2.17 Encoder-decoder based MT model. . . . . . . . . . . . . . . . . . . . . 31
2.18 Image captioning with attention. . . . . . . . . . . . . . . . . . . . . . 32
2.19 Multi-task hybrid encoder-decoder for machine translation. . . . . . . . 33
2.20 Scale dot-product attention diagram. . . . . . . . . . . . . . . . . . . . 34
2.21 Multi-head attention diagram. . . . . . . . . . . . . . . . . . . . . . . 35
2.22 Transformer model diagram. . . . . . . . . . . . . . . . . . . . . . . . 37
2.23 Masked scaled dot-product attention diagram. . . . . . . . . . . . . . . 38
3.1 The diagrams of the ELSTM cell. . . . . . . . . . . . . . . . . . . . . 47
3.2 Experiment of estimating the presence of “A”. . . . . . . . . . . . . . . 49
3.3 Comparison of memory response between LSTM and ELSTM. . . . . . 51
3.4 The DBRNN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 The training perplexity vs. training steps of different cells. . . . . . . . 62
3.6 The training perplexity vs. training steps of different macro models. . . 62
v
3.7 The training perplexity vs. training steps of different models withI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s for the DP task. . . . . . . . . . . . . . . . . . 65
4.1 The Block diagram of the TMPCA method. . . . . . . . . . . . . . . . 70
4.2 Comparison of testing accuracy (%) of fastText (dotted blue), TMPCA+Dense
(red solid), and PCA+Dense (green head dotted), where the horizontal
axis is the input lengthN. . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 The energy of TMPCA (red solid) and PCA (green head dotted) coef-
ficients is expressed as percentages of the energy of input sequences of
lengthN 32, where the horizontal axis indicates the TMPCA stage
number while PCA has only one stage. . . . . . . . . . . . . . . . . . . 88
5.1 Illustration of our proposed approach. We leverage the designed loss
function to tackle a supervised task with the unsupervised dataset only.
SCE means sequential cross-entropy. . . . . . . . . . . . . . . . . . . . 92
5.2 Model overview. Left Panel: The detailed unsupervised multi-modal
neural machine translation model includes five modules, two transformer
encoder, two transformer decoder and one ResNet encoder. Some detailed
network structures within the transformer, like skip-connection and layer
normalization, are omitted for clarity. Right Panel: The entire frame-
work consists of four training paths: the gray arrows in the paths for
cycle-consistency loss indicate the model is under inference mode. E.g.,
the time step decoding for token “hat” is illustrated. . . . . . . . . . . . 96
5.3 Validation BLEU comparison between text-only and text+image. . . . . 104
5.4 Translation results from different models (GT: ground truth) . . . . . . 105
5.5 Correct attention forf“humme”, “chapeau”, “orange”, “chose”g and
f“bleu”, “t-shirt”, “blanc”, “short”g. . . . . . . . . . . . . . . . . . . . 106
5.6 Correct attention forf“chien”, “brun”, “acc` ede” and “surface”g, but
missed “twig” for “´ etang”. . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1 Image captioning errors . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Generated: a bunch of umbrellas that are in the grass Ground Truth:
Parasols of different colors hanging from tall trees . . . . . . . . . . . 114
6.3 Generated: a man and a woman sitting on a bench Ground Truth:
Man riding double bicycle stopped on road reading cell phone . . . . . 115
A.1 The training perplexity vs. training steps of different models withI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s for the POS tagging task. . . . . . . . . . . . . 116
vi
Abstract
In this thesis, we are trying to improve the sequence (language, speech, video, etc.)
learning system’s learning capabilities by enhancing its memory and enabling its visual
imagination.
For the former, we focus on understanding the memory mechanism for sequence
learning problems via recurrent neural networks (RNN) and dimension reduction (DR).
We put forward two viewpoints. The first viewpoint is that memory is a function that
maps certain elements in the input sequences to the current output. Such definition,
for the first time in literature, allows us to do detailed investigation of the memory of
three basic RNN cell models, namely: simple RNN (SRN), long short-term memory
(LSTM), and gated recurrent unit (GRU). We found that all three RNN cell models
suffer memory decay. To overcome this limitation by design, we propose a new basic
RNN model called extended LSTM (ELSTM) with trainable scaling factors for better
memory attention. The ELSTM achieves outstanding performance for various complex
language tasks. The second viewpoint is that memory is a compact representation of
sparse sequential data. From this perspective, dimension reduction (DR) method like
principal component analysis (PCA) becomes attractive. However, there are two known
problems in implementing PCA for sequence learning problems: the first is computa-
tional complexity; the second is input positional relationship preservation. To deal with
vii
these problems, an efficient dimension reduction algorithm called tree-structured multi-
stage PCA (TMPCA) is proposed. Similar to PCA, TMPCA is a linear system with
orthonormal projection. It requires no labeled training data and maintains high mutual
information between its input and output. Compared to the state-of-the-art neural net-
work (NN) based embedding techniques, TMPCA achieves better or commensurate per-
formance for text classification. It typically takes less than one second to train TMPCA
on a large data corpus with millions of samples.
For the latter, we are interested in exploring the deep-learning solution that uti-
lizes both language and visual information (alternatively known as multi-modal system)
under an unsupervised learning environment for machine translation (MT). It is found
that image can help associate semantics across different languages, thus, reduce the
degree of uncertainty for the task. By passing the image’s residue-net(ResNet)-feature
and its transformer-encoded annotation in source language to the transformer decoder,
the model can learn the semantics association between the source and target languages
more efficiently and achieves significant performance improvement on classic WMT
English-French and English-German tasks. In addition, the image-trained model can be
used with or without the presence of associating source-language images for inference.
viii
Chapter 1
Introduction
1.1 Significance of the Research
Information presents in many different forms: they could be a function of space, where
each pixel or voxel constitutes the richness of the two-dimensional (2D)/3D world; they
could also be a function of time/sequence, through which, the flow of information is
always directional - the further away the information is from us, the less observable it
is. For a large number of sequence learning problems including natural language pro-
cessing (NLP) and video processing where the basic input elements are present in a
temporal/sequential order, we are concerned with finding semantic patterns from input
sequences. These patterns can help us in solving the underlining problems by provid-
ing the sequential/temporal information. A learning system’s ability in grasping such
patterns is thus crucial to its final performance. Applications of the learning systems is
omnipresent in our daily live, including but not limited to Spam email detection, text
content analysis, language modeling, machine translation, speech transcription, object
tracking for surveillance systems and interactive systems such as automated customer
services. Despite their ubiquitous presence, a systematic study of their pattern charac-
terization capabilities is lacking. All these make the investigation and exploration of
two pattern characterization capabilities meaningful, namely: memory - the ability to
associate the past and present information and visual imagination - the ability to asso-
ciate sequential information with visual representation. This research tries to do some
fundamental work on formally define what memory is and how such definition can be
1
used to improve various existing learning systems on the one hand, and explores the pos-
sibilities of devising a sequence-learning system with the help of images as additional
information on the other. Both aimed for creating new and better systems that are built
upon solid theories.
For the study of memory, to improve the existing systems, we focus on RNNs as they
are long considered as the most advanced techniques in dealing with temporal/sequential
information. Since Elman’s “Finding structure in time” published in 1990, it has long
been believed that the RNN is able to handle very long sequences by “encoding” the
input into a vector called hidden state. More recently, RNNs have shown breakthroughs
in speech recognition [GMH13], RNN powered machine translation system now gives
similar performance to human translators [Stab], RNN can also describe the content
of images [VTBE15] and videos [BGC17]. Yet, how exactly the “magic power” of
RNN’s memory forms is still went unanswered despite some experimental observa-
tions. Like other NN based solutions, RNNs are often treated as black box: it takes
the input sequences and generated the desired outputs if properly trained. The perfor-
mance of RNN thus depends heavily on the experience of engineers on the training
of RNN models. The tuning of training hyper-parameters of RNN would take a lot
of computing resources and time, its training results would also vary a lot due to its
random model parameter initialization. In addition, the “conventional wisdom” of the
long memory advantage of RNN is also challenged [BKK18] recently, this gives rise
to the proposals of non-RNN based sequence learning systems including convolutional
sequence-to-sequence model [GAMY
17] and Transformer [VSP
17]. Even though
there are some works on the theories of RNN [HS97,RMB13,BSF94,MLY18,VNO18],
there are very few works on analyzing its memory; other works are mostly experimen-
tal [Elm90, PGCB14, GSC00, GSK
17]. All these make the theoretical work on RNN’s
memory not only helpful but necessary.
2
To further broaden the discussion to all memory learning systems, we are also inter-
ested in the systems’ characterization of the sequential information in a compact form.
This leads to the investigation of the DR technique to sequential data. In NLP, DR is
often required to alleviate the so-called “curse of dimensionality” problem [BDVJ03].
This occurs when the numericalized input data are in a sparse high-dimensional space.
Such a problem partly arises from the large size of vocabulary and partly comes from the
sentence variations with similar meanings. Both contribute to high-degree data pattern
diversity, and a high dimensional space is required to represent the data in a numerical
form adequately. Due to the ever-increasing data in the Internet nowadays, the language
data become even more diverse. As a result, previously well-solved problem such as
text classification (TC) faces new challenges [ZJL15, MP18]. An effective DR tech-
nique remains to play a critical role in tackling these challenges.
For the study of multi-modal system for sequence learning tasks, the potential ben-
efits to utilize the rich multimedia content on the Internet and the cross-field knowl-
edge (esp. from computer vision) are significant if such system can be successfully
deployed. The input to the system usually includes images and their associating
textual information (e.g. image annotations), the tasks varies from image caption-
ing [MXY
17, VTBE15, XBK
15] to visual language navigation (VLN) [FHC
18,
MLW
19,WHC
19,WXWW18,AWT
18]. For multi-modal MT, images with annota-
tions in source and target languages are used for training and/or inferencing. However,
it is found recently that most of the multi-modal translation systems are not significantly
better than off-the-shelf text-only MT model [SFSE16]. There remains an open ques-
tion about how translation models should take advantage of visual context, because from
the perspective of information theory, the mutual information of two random variables
I pX;Y q will always be no greater thanI pX;Z;Y q due to the following fact.
3
I pX;Z;Y q I pX;Y q (1.1)
KLpppX;Y;Z q}ppX |Y qppZ |Y qppY qq (1.2)
where the Kullback-Leibler (KL) divergence is non-negative. This makes us believe
that the visual content will help the translation systems. Since the standard paradigm of
multi-modal translation always considers the problem as a supervised learning task, the
parallel corpus is usually sufficient to train a good translation model, and, as aforemen-
tioned, the gain from the extra image input is very limited. Moreover, the scarcity of
the well formed dataset including both images and the corresponding multilingual text
descriptions is a constraint to prevent the development of more scaled models.
In order to address these issues, we propose to formulate the multi-modal translation
problem as an unsupervised learning task where source and target language corpora are
prepared separately rather than one-to-one paired, which is closer to real applications.
This is particularly important given the massive amounts of paired image and monolin-
gual text data being produced everyday (e.g., news title and its illustrating picture).
1.2 Related Work
1.2.1 RNN’s Memory
Before Michael Jordan introduced the very first groundbreaking work on RNN [Jor97]
to the sequence learning world, theorists in this field often relies on the tools developed
for dynamic systems and modeling the systems on state machines. In those models such
as hidden Markov model (HMM), Kalmann filter or Naive Bayes (NB), the concept of
memory stays irrelevant since they either do not explicitly models the order of events
4
in a sequence or presume that events are only associated with each other through their
immediate neighbors. The emergence of RNN, for the first time, makes the learning of
complex ordering of sequences possible. It was then shown by Elman [Elm90] that an
RNN builds an internal representation of semantic patterns. The memory of a cell char-
acterizes its ability to map input sequences of certain length into such a representation.
Here, we define the memory as a function that maps elements of the input sequence to
the current output. Thus, the memory of an RNN is not only about whether an element
can be mapped into the current output but also how this mapping takes place. It was
reported by Gers et al. [GSC00] that an simple recurrent network (SRN) only memo-
rizes sequences of length between 3-5 units while a long short-term memory (LSTM)
could memorize sequences of length longer than 1000 units. In [VNO18], the expressive
power (which is a concept closely related to memory) of RNN is investigated with the
conclusion that RNN is exponentially more expressive than, say, a shallow CNN model.
1.2.2 Dimension Reduction for Sequence
In contrast to the scarcity of the work on RNN’s memory, there are many works on
DR for NLP. The mainstream ones are the neural network (NN) based techniques
[ZJL15,MVN13,GSZ13,ACPSRI17,CXHW17,JGBM17]. In [BDVJ03], each element
in an input sequence is first numericalized/vectorized as a vocabulary-sized one-hot vec-
tor with bit “1” occupying the position corresponding to the index of that word in the
vocabulary. This vector is then fed into a trainable dense network called the embedding
layer as shown in Fig. 1.1.
The output of the embedding layer is another vector of a reduced size. In [MSC
13],
the embedding layer is integrated into a recurrent NN (RNN) used for language model-
ing so that the trained embedding layer can be applied to more generic language tasks.
Both [BDVJ03] and [MSC
13] conduct dimension reduction at the word level. Hence,
5
Figure 1.1: NN based word embedding.
they are called word embedding methods. Among non-neural-network dimension reduc-
tion methods [DDF
90,KBDB13,WLC
15,YZL09,Uys16,CZLZ16,SLK19], the prin-
cipal component analysis (PCA) is a popular one. In [DDF
90], sentences are first
represented by vocabulary-sized vectors, where each entry holds the frequency of a par-
ticular word in the vocabulary. Each sentence vector forms a column in the input data
matrix. Then, the PCA is used to generate a transform matrix for dimension reduction
on each sentence.
1.2.3 Unsupervised Multi-modal Machine Translation (UMNMT)
Similar to the research on RNN’s memory, multi-modal system for super-
vised/unsupervised MT is a brand new area that recently gains attraction in academia.
To the best of our knowledge, our research on unsupervised multi-modal systems for
MT is one of the few pioneering work [SFB
19,YLL18] in this field. Most of the exist-
ing multi-modal neural MT (NMT) systems are supervised [HSFA18, CLC17, EK17,
SFSE16, LHM18, TMS
16], whose detailed designs will be elaborated in Sec. 2.3.1.
We place our work in context by arranging several prior popular topics, along the
axes of unsupervised NMT (UNMT), image caption and multi-modal MT.
Unsupervised Machine Translation Existing methods in this area [ALAC18,
LCDR18, LOC
18] are mainly modifications of encoder-decoder schema. Their key
6
ideas are to build a common latent space between the two languages (or domains) and
to learn to translate by reconstructing in both domains. The difficulty in multi-modal
translation is the involvement of another visual domain, which is quite different from
the language domain. The interaction between image and text are usually not symmet-
ric as two text domains. This is the reason why we take care of the attention module
cautiously.
Image Caption Most standard image caption models are built on CNN-RNN based
encoder-decoder framework [KFF15, VTBE15], where the visual features are extracted
from CNN and then fed into RNN to output word sequences as captions. Since our
corpora contain image-text paired data, our method also draws inspiration from image
caption modeling. Thus, we also embed the image-caption model within our computa-
tional graph, whereas the transformer architecture is adopted as a substitution for RNN.
Multi-modal Machine Translation This problem is first proposed by [SFSE16] on
the WMT16 shared task at the intersection of natural language processing and com-
puter vision. It can be considered as building a multi-source encoder on top of either
MT or image caption model, depending on the definition of extra source. Most Multi-
modal MT research still focuses on the supervised setting, while [YLL18, NN17], to
our best knowledge, are the two pioneering works that consider generalizing the Multi-
modal MT to an unsupervised setting. However, their setup puts restrictions on the input
data format. For example, [YLL18] requires the training data to be image text pair but
the inference data is text-only input, and [NN17] requires image text pair format for
both training and testing. These limit the model scale and generalization ability, since
large amount of monolingual corpora is more available and less expensive. Thus, in
our model, we specifically address this issue with controllable attention and alternative
training scheme.
7
1.3 Contributions of the Research
Three works are proposed for understanding and improving sequence learning systems.
Two of them are based on the different definitions of memory, the third one is a multi-
modal design for unsupervised NMT.
Memory as a function that maps elements in a sequence to the current output.
Based on this definition, we investigated the function response between SRN,
LSTM and GRU and found the following properties of their memory.
1. Similarity The memory function of LSTM is similar to GRU.
2. Longer Memory The LSTM and GRU is capable to retain longer memory
response than SRN due to their forget gate and update gate respectively.
3. Memory Decay Due to the presence of forget gate and update gate of LSTM
and GRU, they suffer memory decay regardless of the choice of their model
parameters.
To the best of our knowledge, the findings in 1 - 1.3 are unprecedented. It con-
firms the conventional wisdom of the memory capability of LSTM and GRU in
comparison to SRN, it also challenges the claim that their memory can go to infin-
ity.
Our work in RNN’s memory provides a framework that people can expand upon,
possible directions include: the rate of memory decay of LSTM and GRU and
how to avoid memory decay. On the second front, we offered a solution called
ELSTM. ELSTM uses a scaling factor to better attend to the semantic input if
there is a memory decay. It has achieved up to 30% gain compared to LSTM and
GRU for complex language tasks.
8
Memory as a compact representation of sparse sequential data.
This definition allows us to explore the available techniques for DR. We proposed
a very efficient DR system called TMPCA which has the following advantages to
the existing solutions:
1. High efficiency. Reduce the input data dimension with a small model size
at low computational complexity.
2. Low information loss. Maintain high mutual information between an input
and its dimension-reduced output.
3. Sequential preservation. Preserve the positional relationship between input
elements.
4. Unsupervised leaning. Do not demand labeled training data.
TMPCA is inspired by PCA but with much lower computation complexity and
similar information preservation property. A dense (fully connected) network
trained on the TMPCA preprocessed data achieves better performance than state-
of-the-art fastText and other neural-network-based solutions, which shows the
non-RNN approach also merits attention for sequence learning systems.
UMNMT
1. UMNMT with end-to-end transformer We formuate the multi-modal MT
problem as unsupervised setting that fits the real scenario better and propose
an end-to-end transformer based multi-modal model.
2. UMNMT with controllable attention We present two technical contribu-
tions: successfully train the proposed model with auto-encoding and cycle-
consistency losses, and design a controllable attention module to deal with
both uni-modal and multi-modal data.
9
3. Significant performance boost We apply our approach to the Multilin-
gual Multi30K dataset in English French and English German transla-
tion tasks, and the translation output and the attention visualization show the
gain from the extra image is significant in the unsupervised setting.
1.4 Organization of the Dissertation
The rest of this dissertation is organized as follows. The basic models of RNN and their
implementations, the basic DR techniques for sequential data, and the state-of-the-art
NMT model along with multi-modal systems are introduced in Chapter 2. The memory
investigation of RNN and the ELSTM model are elaborated in Chapter 3. Chapter 4
introduces TMPCA for text classification problems. The UMNMT model, its training
method and performance are presented in Chapter 5. Finally, concluding remarks and
future works of understanding sequence learning systems with memory are pointed out
in Chapter 6
10
Chapter 2
Background Review
2.1 Recurrent Neural Networks
RNN is a particular type of NN that has at least one cyclic connection. More specifically,
an RNN is a network that “is possible to follow a path from a unit back to itself” [Jor97].
The motivation for such design is that RNN can model the sequential pattern more effec-
tively by using the cyclic connections to describe the sequential/temporal dependencies
in a more flexible way. For example, an RNN shown in Fig. 2.1 can produce sequence
of AAAB, where A is the vector
1
1
and B is the vector
0
0
.
Figure 2.1: RNN that produces AAAB.
11
Each circle in the figure denotes a neuron which takes weighted input and compute
an output based on its activation function. The weights for the signal from one neuron
to the other are shown beside the link between them. The biases are shown in the circle.
The RNN has two input neurons, one hidden neuron and two output neurons. The input
neurons have linear activation function, while the other ones have binary thresholding
function that outputs 1 if the weighted input is positive, and outputs 0 otherwise. The
output of each neuron is shown in Table 2.1
Table 2.1: Activation of Neurons.
Time Input Activation Hidden Activation Output Activation
0 r0; 0s
T
0 r1; 1s
T
A
1 r1; 1s
T
0 r1; 1s
T
A
2 r1:5; 1:5s
T
0 r1; 1s
T
A
3 r1:75; 1:75s
T
1 r0; 0s
T
B
Like other artificial NN, RNN can be trained using stochastic gradient descent
(SGD).
In practice, there are three popular RNN basic computing models: SRN, LSTM and
GRU. They are also called cell models. Built upon these cells, various RNN models
have been proposed. To name a few, there are the bidirectional RNN (BRNN) [SP97],
the encoder-decoder model [CMG
14, SVL14, VKK
15, BCB15] and the deep RNN
[PGCB14]. We will introduce the cell models and the macro models in the following
sections.
2.1.1 RNN Cell Models
SRN is one of the earliest proposals of RNN. It has several variations, namely, the
Jordan’s model and the Elman’s model as shown in Eq. 2.1, and Eq. 2.3 respectively.
12
c
t
f
c
pW
c
c
t 1
W
in
X
t
b
c
q; (2.1)
h
t
f
h
pW
h
c
t
b
h
q: (2.2)
c
t
f
c
pW
c
h
t 1
W
in
X
t
b
c
q; (2.3)
h
t
f
h
pW
h
c
t
b
h
q: (2.4)
Where subscriptt is the time step index. X
t
PR
M
is the input vector at time stept.
In NLP, sequence elements like characters, words or other symbols are numericalized by
assigning them an unique vector which indicates their entry position in the dictionary.
Such vector is called one-hot vector if its size is equal to the total number of elements
in the dictionary, and the position of bit 1 is their entry position with other positions
occupied by bit 0. W and b are the network weights and biases. h
t
P R
N
, where N
is the number of cells, is the cell output with activation function f
h
, and c
t
P R
N
is
called the hidden state with activation functionf
c
. For the rest of the thesis, we omit the
bias terms by including them in the corresponding weight matrices. The multiplication
between two equal-sized vectors is element-wise multiplication.
The recurrent connection of both models can be found in Eq. 2.1 and Eq. 2.3 respec-
tively: in Elman’s model, the hidden state is fed into the cell, in Jordan’s model, the cell
output is looped back instead. It can be seen that the hidden state is a function not only of
current input, but also of all previous inputs as well. Thus, such computing architecture
is ideal for building a probabilistic model to calculate the posterior of pph
t
|tX
i
u
t
i 1
q.
It was found in [Elm90] that the hidden state of SRN is able to group the words with
13
similar properties into close distanced clusters in a projected compact space as shown in
Fig. 2.2.
Figure 2.2: Clustering capability of SRN’s hidden state.
One usage of SRN is to predict the next element given the previous elements in a
sequence, which is also called language modeling (LM) as shown in Fig. 2.3
LSTM was proposed in [HS97] in 1997 and it is still one of the state-of-the-art RNN
cell models today. It was motivated to solve the SRN’s gradient vanishing/exploding
problem. Such problem arises from the particular way in training RNN models, which
is called back-propagation through time (bptt) as illustrated in Fig. 2.4.
14
Figure 2.3: SRN for character prediction.
Figure 2.4: Back-propagation through time.
Fig. 2.4 shows an unrolled depiction of RNN model across time steps of t 1;t;
and t, the error gradient () at t 1 is back-propagated to all the model parameters
(the weights W and biases b) to t;t 1 and so on. Unlike other feed-forward NN
models, the is back-propagated not only to the model at current time step, it also back-
propagated to previous time steps all the way to the beginning of the sequence if the
bptt length is not truncated. Bptt will stop at a pre-defined distance to the current time
step t if it is truncated. It can be seen that RNN is inherently af deep model in time.
One difference to a deep CNN model is that the weights and biases across time steps
share parameters. The motivation of btpp is that the RNN can be trained more efficiently
15
by explicitly considering the temporal/sequential nature of the input signal. Like other
deep models, the SRN would face increasing difficulty in the training process when the
input sequence becomes longer: due to the increasing number of layers over time, the
error gradient would either vanish or explode [HS97, RMB13] as shown in the paired
comparison between ordinary bp and bptt in Eq. 2.5 and Eq. 2.6 respectively with
indicates the learning rate.
W
c
B
t
BW
c
W
c
; (2.5)
W
c
B
t
Bc
t
1
t 1
„
i t
1
Bc
i 1
Bc
i
Bc
t
1
BW
c
W
c
: (2.6)
LSTM solve this problem by introducing an internal recurrent structure called con-
stant error carousal (CEC) as shown in Fig. 2.5
Figure 2.5: The diagram of original LSTM cell.
Fig. 2.5 shows the original proposal of LSTM with, and b denote the hyperbolic
tangent function, the sigmoid function and the multiplication operation, respectively. All
of them operate in an element-wise fashion. The LSTM cell has an input gate, an output
gate, and a constant error carousal (CEC) module. The central idea of CEC is to force
16
the error gradient for the recurrent connection, the
Bc
i 1
{
Bc
i
; @i P 1;:::;t, to be 1 so that
the multiplicative term in Eq. 2.6 does not converge to zero or diverge to infinity. The
presence of input gate and output gate is to ensure the error gradient does not flow from
the current cell to other cells.
It was later found that the gradient 1 enforcement of CEC would make the hidden
state of LSTM to either increase or decrease to a prolonged period of time if the input
sequence patterns keeps similar. This is called hidden state overflow problem due to the
absolute value ofc
t
becomes very large so that its hyperbolic tangent activated output
would stay either 1 or 1 as shown in Fig. 2.6.
Figure 2.6: Hidden state of LSTM overflow problem.
Fig. 2.6 shows the evolution ofc
t
as the input sequence becomes longer. The input
is a series of repeating symbol “T” and “P”. The interval length between symbols in
the horizontal axis denotes the length of the repeating symbols start from its left, the
vertical line denotes the start of a new series. As it can be seen, the hidden state of
LSTM increases or decreases regardless of the input of new series.
17
It is solved by introducing a forget gate which allows the recurrent error gradient to
be equal or less than one, which is equivalent to allowing gradient vanishing as long as
the memory length is long enough for the specific task. In [GSC00], it was observed
that the forget gate is acting like a resetting mechanism that when a terminal signal
is received, its activation will be set close to zero so that the hidden state c
t
will be
“refreshed” as if the previous elements in the sequence is forgotten as shown in Fig. 2.7.
We will later argue in Chapter 3 that such “resetting” phenomenon actually stems from
the memory decay property of forget gate and this would constrain the LSTM’s memory
capability.
Figure 2.7: Resetting of hidden state by introducing forget gate in LSTM.
Mathematically, the LSTM cell can be written as
c
t
pW
f
I
t
qc
t 1
pW
i
I
t
qpW
in
I
t
q; (2.7)
h
t
pW
o
I
t
qpc
t
q; (2.8)
18
where c
t
P R
N
, column vector I
t
P R
pM N q
is a concatenation of the current input,
X
t
P R
M
, and the previous output, h
t 1
P R
N
(i.e., I
T
t
r X
T
t
;h
T
t 1
s). Furthermore,
W
f
,W
i
,W
o
andW
in
are weight matrices for the forget gate, the input gate, the output
gate and the input, respectively. The detailed block diagram is shown in Fig. 2.8
Figure 2.8: The diagram of a LSTM cell with forget gate.
GRU was originally proposed for neural machine translation [CMG
14]. It provides
an effective alternative for the LSTM. Its operations can be expressed by the following
four equations:
z
t
pW
z
X
t
U
z
h
t 1
q; (2.9)
r
t
pW
r
X
t
U
r
h
t 1
q; (2.10)
~
h
t
pWX
t
U pr
t
bh
t 1
qq; (2.11)
h
t
z
t
h
t 1
p 1 z
t
q
~
h
t
; (2.12)
whereX
t
,h
t
,z
t
andr
t
denote the input, the hidden-state, the update gate and the reset
gate vectors, respectively, and W
z
, W
r
, W , are trainable weight matrices. Its hidden-
state is also its output, which is given in Eq. (2.12).
19
Figure 2.9: The diagram of a GRU cell.
As shown in Fig. 2.9, it can be seen that if for the same inputX
t
and the number
of cellsN, the GRU cell uses less number of parameters as compared to LSTM shown
in Fig. 2.8 since LSTM has four weight matrices for input, input gate, output gate and
forget gate whereas GRU has only three weight matrices for update gate, reset gate and
the weight connection for
~
h. In the original GRU paper [CMG
14], it was argued that
the reset gate is functioning like the forget gate in LSTM and the update gate functions
like a valve to control the flow of information from the previous inputs. We however will
show in Chapter 3 that GRU is actually the same as LSTM in term of their computing
architecture with no fundamental difference.
There is no forgone conclusion that whether GRU outperforms LSTM or not. It was
observed that GRU does converge faster than LSTM for some particular RNN models,
which we will elaborate in the following sections, like sequence to sequence (seq2seq)
model. However, in [GSK
17], it was concluded that there is neither obvious advantage
nor disadvantage of LSTM as compared to GRU. In [CGCB14], it was found that GRU
outperform LSTM in some cases, but no conclusive remarks was given.
20
2.1.2 RNN Macro Models
Single RNN cell models are rarely used in practice due to their limited expressiveness
for modeling the real problems. Instead, more powerful RNN models built upon these
cells are used with different probabilistic models. One problem of particular interest is
called sequence-in-sequence-out (SISO), or sequence to sequence problem. In is prob-
lem, the RNN model predicts an output sequence, tY
t
u
T
1
t 1
withY
i
P R
N
, based on an
input sequence, tX
t
u
T
t 1
withX
i
PR
M
, whereT andT
1
are lengths of the input and the
output sequences, respectively.
One of the popular macro RNN models for solving the SISO problem is BRNN
[SP97]. As its name indicates, BRNN takes inputs in both forward and backward direc-
tions as shown in Fig. 2.10, and it has two RNN cells to take in the input: one takes the
input in the forward direction, the other takes the input in the backward direction.
Figure 2.10: The diagram of BRNN.
The motivation for BRNN is to fully utilize the input sequence if future information
(tX
i
u
T
i t 1
) is accessible. This is especially helpful if current outputY
t
is also a function
of future inputs. The conditional probability density function of BRNN is in form of
21
P pY
t
|tX
i
u
T
i 1
q W
f
p
f
t
W
b
p
b
t
; (2.13)
^
Y
t
arg max
Yt
P pY
t
|tX
i
u
T
t 1
q; (2.14)
where
p
f
t
P pY
t
|tX
i
u
t
i 1
q; (2.15)
p
b
t
P pY
t
|tX
i
u
T
i t
q; (2.16)
andW
f
andW
b
are trainable weights,
^
Y
t
is the predicted output element at time step
t. So the output is a combination of the density estimation of a forward RNN and the
output of a backward RNN. Due to the bidirectional design, the BRNN can utilize the
information of the entire input sequence to predict each individual output element. One
example where such treatment is helpful is generating a sentence like “this is an apple”
for LM. In this case, the word “an” strongly associates with its following word “apple”,
in a forward directional RNN model, it would find difficulty in generating “an” before
“apple”.
Encoder-decoder was first proposed for machine translation (MT) along with GRU
in [CMG
14]. It was motivated to handle the situation whenT
1
T . It is consist of
two RNN cells: one is called encoder, the other is called decoder. The detailed design
of one of the early proposals [SVL14] of encoder-decoder RNN model is illustrated in
Fig. 2.11.
As can be seen in Fig. 2.11, the encoder (denoted by Enc) takes the input sequence
of lengthT and generates its outputh
Enc
i
and hidden statec
Enc
i
, wherei P t1;:::;T u. In
seq2seq model, the encoder’s hidden state at time stepT is used as the representation of
22
Figure 2.11: The diagram of sequence to sequence (seq2seq).
the input sequence. The decoder then utilizes the hidden state information to generate
the output sequence of length T
1
by initializing its hidden state c
Dec
1
as c
Enc
T
. So the
decoding process starts after the encoder has processed the entire input sequence. In
practice, the input to the decoder at time step 1 is a pre-defined start decoding symbol.
At the following time steps, the previous outputY
t 1
will be used as input. The decoder
will stop the decoding process if a special pre-defined stopping symbol is generated.
Compare with BRNN, encoder-decoder is not only advantageous in its ability in
handling input/output sequences with different length, it is also more capable in gener-
ating more aligned output sequence by explicitly feeding the previous predicted outputs
back to its decoder so that the prediction ofY
t
can have more context, which makes the
model to estimate the following density function
23
p
t
P pY
t
|t
^
Y
i
u
t 1
i 1
; tX
i
u
T
i 1
q (2.17)
^
Y
t
arg max
Yt
p
t
@t P t1;:::;T
1
u: (2.18)
One example where such treatment is helpful is to translate from a sentence in Chi-
nese “你来自哪里” to English “where are you from” where “你” corresponds to “you”,
“来自” corresponds to “from”, “哪里” corresponds to “where”, and the word “are” has
no corresponding Chinese alignment. So the word “are” is more pertinent to the word of
“where” and “you” in the translated English sentence than to the source sentence in Chi-
nese. Since the decoder of seq2seq has no bidirectional design, which will be addressed
later in Chapter 3, “are” cannot context on “you”, but nevertheless, “where” should give
strong hint as to what should be generated next.
To further encourage the aliment, various attention mechanism has been proposed
for encoder-decoder model. In [VKK
15,BCB15], additional weighted connections are
introduced to connect the decoder to the hidden state of the encoder.
It is worthwhile to point out that even though the encoder-decoder model, and RNN
in general can take variable lengthed input sequences during inference. To train the RNN
model, an fixed input/output length is still used in practice due to the bptt is applied on an
unrolled RNN model. Any sequence that is longer than the fixed length will be truncated
and any shorter ones will be padded by pre-defined special symbol or numericalized
vector.
Deep RNN is a RNN model with multiple stacked RNN cells. Fig. 2.12 shows one
example.
As discussed in Sec. 2.1.1, RNN is itself a deep model over time, the Deep RNN
extends the model depth, making it a deep feed-forward model at each time step.
24
Figure 2.12: Example of Deep RNN.
Deep RNN model design is still an ongoing topic. At present, there is no mature
prototype. In [SVL14], seq2seq model with 4 layers of LSTM was reported to deliver
the best performance for MT. In [Wea16], residue connection [HZRS16a] is employed in
the deep design which achieves a deep encoder-decoder model with 9 layers of encoder
and 8 layers of decoder. In [PGCB14], 5 different models were proposed as shown in
Fig. 2.13. However, there is no conclusion that which model gives the best performance.
Figure 2.13: Deep RNN Proposals.
25
2.2 Dimension Reduction for Sequence
2.2.1 Principle Component Analysis (PCA)
PCA is one of the most well known unsupervised DR techniques. It is a linear system
that usually takes one dimensional vector of sizeD as input and transform it to another
one dimensional vector of sizeD
1
andD
1
⁄D.
y U
PCA
x;x PR
D
;y PR
D
1
(2.19)
The objective of PCA is to find a linear transform that can project the input into a
lower dimensional space and maximize the output variance along each projected dimen-
sion. To achieve this, a training dataset of M samples withX x
T
1
.
.
.
x
T
M
is first mean
removed. Its covariance matrix is then calculated as
1
{
M
XX
T
, we would then diagonal-
ize the covariance matrix and get its eigenvectors ru
1
;:::;u
D
s;u
i
PR
D
; @i P t1;:::;Du
and their corresponding eigenvalues
1
;:::;
D
(note the covariance matrix is a normal
matrix and semi-positive definite, so it is diagonalizable and its eigenvalues are real
numbers). Denote the index of thejth largest value from tx
i
u as argmax
j
txu, the PCA
transform matrixU
PCA
is formed as follows:
U
PCA
u
T
argmax
1
tu
.
.
.
u
T
argmax
D
1 tu
(2.20)
which are the eigenvectors that corresponds to the descending ordered largest D
1
number of eigenvalues. It can be proved thatUU
T
I whereI is identity matrix, and
the variance ofy along each dimension is maximized in the process
26
Another interesting property of PCA is that its variance maximization also makes
PCA capable in minimizing the reconstruction error which is defined as:
error 1
M
M
‚
i 1
||x
i
U
T
Ux
i
||
2
; (2.21)
where ^ x
i
U
T
Ux
i
is called the reconstruction ofx
i
. It is obvious that ifU PR
D D
,
then ^ x
i
x
i
. This property is of particular value for lossy compression.
Under certain conditions, PCA is also proved to be able to maximize the
mutual information between x and y [Lin88], which is defined as I px;y q E
x;y
ln
P px|y q
{
P pxq
, where E denotes expectation, P denotes probability density func-
tion and ln denotes natural logarithm. This makes it ideal in input data preprocessing for
classification problems. We will elaborate the same property for our proposed TMPCA
in Chapter 4.
The training method of PCA shown in Eq.2.20 has difficulty in dealing with large
training dataset. It was shown in [BH89, IR10] that PCA can be trained using SGD by
treating theU
PCA
as a trainable NN. The reason behind it is that the cost function of PCA
NN is strictly convex, so the training result would converge to the global optima.
2.2.2 Text Classification
Text classification has been an active research topic for two decades. Its applica-
tions such as spam email detection, age/gender identification and sentiment analysis
are omnipresent in our daily lives. Traditional text classification solutions are mostly
linear and based on the BoW representation. One example is the naive Bayes (NB)
method [FDM97], where the predicted class is the one that maximizes the posterior
probability of the class given an input text. The NB method offers reasonable perfor-
mance on easy text classification tasks, where the dataset size is small. However, when
27
the dataset size becomes larger, the conditional independence assumption used in likeli-
hood calculation required by the NB method limits its applicability to complicated text
classification tasks.
Other methods such as the support vector machine (SVM) [MVN13,YZL09,Joa98]
fit the decision boundary in a hand-crafted feature space of input texts. Finding represen-
tative features of input texts is actually a dimension reduction problem. Commonly used
features include the frequency that a word occurs in a document, the inverse-document-
frequency (IDF), the information gain [Uys16, CZLZ16, SB88, YP97], etc. Most SVM
models exploit BoW features, and they do not consider the position information of words
in sentences.
The word position in a sequence can be better handled by the CNN solutions since
they process the input data in sequential order. One example is the character level CNN
(char-CNN) as proposed in [ZJL15]. It represents an input character sequence as a
two-dimensional data matrix with the sequence of characters along one dimension and
the one-hot embedded characters along the other one. Any character exceeding the
maximum allowable sequence length is truncated. The char-CNN has 6 convolutional
(conv) layers and 3 fully-connected (dense) layers. In the conv layer, one dimensional
convolution is carried out on each entry of the embedding vector.
RNNs offer another NN-based solution for text classification [MP18, CXHW17].
An RNN generates a compact yet rich representation of the input sequence and stores it
in form of hidden states of the memory cell. It is the basic computing unit in an RNN.
There are two popular cell designs: the long short-term memory (LSTM) [HS97] and the
gate recurrent unit (GRU) [CMG
14]. Each cell takes each element from a sequence
sequentially as its input, computes an intermediate value, and updates it dynamically.
Such a value is called the constant error carousal (CEC) in the LSTM and simply a
hidden state in the GRU. Multiple cells are connected to form a complete RNN. The
28
intermediate value from each cell forms a vector called the hidden state. It was observed
in [Elm90] that, if a hidden state is properly trained, it can represent the desired text pat-
terns compactly, and similar semantic word level features can be grouped into clusters.
This property was further analyzed in [SK18]. Generally speaking, for a well designed
representational vector (i.e. the hidden state), the computing unit (or the memory cell)
can exploit the word-level dependency to facilitate the final classification task.
Another NN-based model is the fastText [JGBM17]. As shown in Fig. 2.14, it is a
multi-layer perceptron composed by a trainable embedding layer, a hidden mean layer
and a softmax dense layer. The hidden vector is generated by averaging the embedded
word, which makes the fastText a BoW model. The fastText offers a very fast solution
to text classification. It typically takes less than a minute in training a large data corpus
with millions of samples. It gives the state-of-the-art performance.
Figure 2.14: Illustration of the fastText model.
29
2.3 Unsupervised Multi-modal Neural Machine Trans-
lation
2.3.1 Neural Multi-modal Systems
The NN based multi-modal [MXY
17, VTBE15, XBK
15, TMS
16, HSFA18, EK17,
SFB
19, YLL18, SK15, SLK16, CLC17, BGC17, SFSE16, LHM18, GM18, FHC
18,
MLW
19, WHC
19, WXWW18, AWT
18] architecture can be used to handle tasks
that take language and image data as input. Fig. 2.15 shows one of the early proposals
of multi-modal design for image captioning problem.
Figure 2.15: Multi-modal architecture for image captioning.
The input to the system is the image from the right, it is fed into a CNN, the output
of the CNN is a feature vector from its last fully connected layer. The very first word
of the caption is generated based on this feature vector by feeding it to the multi-modal
module - a fully connected layer that takes three inputs: the image feature vector, the
embedded current word in the caption and the output of the RNN at current time step.
The latter two inputs are initialized as vector of zeros at the beginning. All these inputs
are concatenated to form a vector. The multi-modal layer is then connected to a softmax
layer which produces the final distribution estimation of the next word. The next word
is then feed back into the system. Like seq2seq model in Sec. 2.1.2, the whole caption
is generated if a pre-defined termination symbol is generated.
30
Multi-modal design is motivated to solve problems that involve NLP and CV , or
more generally, the problems that deal with the high dimensional data with spatial cor-
relation and the high dimension data with sequential correlation. For the image in,
language out (IILO) problem like image captioning, multi-modal is also referred as
machine translation model [DBFF02, VTBE15] with image as the source sentence and
the associate language as the target sentence as shown in Fig. 2.16.
Figure 2.16: Machine translation model.
One thing in common for MT models regardless of the tasks is that they are encoder-
decoder based with CNN as the encoder for processing the image and RNN as the
decoder for generating the sentence as shown in Fig. 2.17. We would refer the net-
work with CNN and RNN (not necessarily encoder-decoder based) as hybrid network
for the rest of this dissertation.
Figure 2.17: Encoder-decoder based MT model.
31
Like encoder-decoder RNN, the hybrid encoder-decoder also uses attention mech-
anism for better performance. Fig. 2.18 shows an interesting proposal of a hybrid
encoder-decoder with RNN attends to different regions of an image at a time during
caption generation stage.
(a)
(b)
Figure 2.18: Image captioning with attention.
For multi-modal MT, the problem is formulated as efficient training with auxil-
iary/image data for SISO problem. It can be used for languages with relatively small
speaking population, in which case the translated training data for supervised learning
is limited. To solve this problem, we can
1. Increase the training dataset using the auxiliary data
2. Increase the learning efficiency by introducing auxiliary data in the training pro-
cess
For the second option, one can do transfer learning, or more specifically, multi-task
learning using image data. In [EK17], a hybrid encoder-decoder is proposed to learn
two tasks for MT as shown in Fig. 2.19
32
Figure 2.19: Multi-task hybrid encoder-decoder for machine translation.
The network has two learning tasks: one is ordinary MT with the target sentence
as the target, the other is to estimate the image CNN feature vector. The former task
uses an RNN, the latter uses a feed-forward network. The two decoders share the same
encoder which is a BRNN with attention. During inference, the decoder tasked for MT
is used together with the encoder, the image prediction decoder is not included. The
dataset used for the training is Multi30K dataset [EFSS16a] which is a dataset for image
captioning. Each image in the dataset has captions in multiple languages. The training
of ImagiNet requires both images and their corresponding annotations in source and
target languages, which is infeasible in practice.
33
2.3.2 Transformer
Transformer [VSP
17] is one of the popular state-of-the-art NMT models. Unlike RNN
( [SVL14, VKK
15, BCB15, Wea16]) or CNN ( [GAMY
17]) based MT solutions,
Transformer is built only on dense and residual connections. So Transformer does not
rely on recurrent connections nor sliding window to infer the temporal information.
Instead, its trainable connections are built across different dimensions (temporal and
feature) of the sequential input and output so that each element in the output sequence
will directly connect to all the elements (before and after) in the input sequence. It is
argued that such treatment can model the elements’ inter-dependencies more explicitly
and lead to better input-output attention.
The cross-dimensional connections are realized by a so called “scaled dot-product
attention” computing unit, whose diagram is shown as follows:
Figure 2.20: Scale dot-product attention diagram.
As shown in Fig. 2.20, there are three inputs to the attention module, namely the
query (Q), the key (K) and the value (V). The output of the scaled dot-product attention
is then calculated as:
34
AttentionpQ;K;V q softmaxp
QK
T
a
pd
k
q
qpV q (2.22)
whereQ;K andV are of dimensionL
Q
d
k
;L
K
d
k
andL
K
d
v
respectively, the
L
Q
andL
K
are the sequence length and thed
k
andd
v
are the dimension of each element
in the sequence. You will find that the temporal attention takes place between the matrix
multiplication of the query and the key, the subsequent multiplication with the value
attends the feature of each input element to the features of the other input and output
elements. To scale this attention model into a larger and more parallelizable one for the
benefit of speed up parallel computing, the same inputs go through multiple attention
units. Their outputs are then concatenated and densely connected. This concatenation
of multiple scaled dot-product attention units is called multi-head attention with each
head denotes one basic attention unit.
Figure 2.21: Multi-head attention diagram.
35
MultiHeadpq;k;v q Concatphead
1
;:::; head
h
qW
O
(2.23)
head
i
AttentionpqW
Q
i
;kW
K
i
;vW
V
i
q (2.24)
whereq PR
L
Q
d
model
,k PR
L
K
d
model
andv PR
L
K
d
model
, withd
model
denotes the
Transformer internal element dimension. W
Q
i
P R
d
model
d
k
;W
K
i
P R
d
model
d
k
;W
V
i
P
R
d
model
dv
;W
O
P R
hdv d
model
are trainable dense connections (weights). In Trans-
former, the input sequence X P R
L
X
d
model
can attend to itself, which is called self-
attention. The q;k; and v to the self-attention come from the same input sequence;
specifically, q k v X. So for self-attention, L
Q
L
K
L
X
. The atten-
tion can also happen between two sequences X and Y P R
L
Y
d
model
, in which case
q Y;k v X, andL
Q
L
Y
;L
K
L
X
. It is called encoder-decoder attention.
The detailed design of Transformer is shown in Fig. 2.22. It can be seen that sim-
ilar to other NMT solutions, Transformer is also encoder-decoder based. The encoder
encodes the input sequence, the decoder takes in the encoder output and the decoder
output at previous time step, and generate the output sequence one element at a time.
Unlike RNN or CNN which processes one or part of the sequence at a time, the
Transformer processes the entire input sequence once and its mostly matrix multiplica-
tion operations do not explicitly model the positional information. The positional encod-
ing, as its name indicates, is to add such information of each element in the sequence.
The input sequences to the encoder and the decoder are first positional encoded as fol-
lows:
36
Figure 2.22: Transformer model diagram.
PE
2i
sinppos{10000
2i{d
model
q (2.25)
PE
2i
cosppos{10000
2i{d
model
q (2.26)
wherepos is the position andi is the dimension of the feature. It can be seen that
the positional encoding can generate a periodic value for each position and dimension
as a sinusoid, it is argued that such treatment can help the Transformer to learn to attend
37
by relative positions. The positional encoding generates a tensor of positional informa-
tion with the same dimension as the embedded input sequence, it is then added to the
embedded input sequence.
The input sequence is then self-attended and densely feed-forwarded in each layer.
Each attention and feed-forward operation comes with a residual [HZRS16a] connection
to ensure the model training efficiency.
The decoder follows similar design with added encoder-decoder attention in-
between. The self-attention at decoder is slightly different to its encoder counterpart
since the decoder’s input sequence needs to be updated one element at a time until the
entire output sequence has been generated. In the process, to ensure the dimension con-
sistency, the empty positions of the sequence needs to be masked with a special tensor
so that these positions are not attended or to be back-propagated in the training process.
Figure 2.23: Masked scaled dot-product attention diagram.
38
Transformer has outperformed existing RNN and CNN based NMT models and is
now the dominant NMT model in use. Another advantage of Transformer is its par-
allelization friendly multi-head design, which usually take a fraction of time to train
compare to other NMT models.
39
Chapter 3
Sequence Analysis via Recurrent
Neural Networks
3.1 Motivation
As discussed in Sec. 2.1.1, LSTM and GRU cells were designed to enhance the memory
length of RNNs and address the gradient vanishing/exploding issue [HS97, RMB13,
BSF94], yet thorough analysis on their memory decay property is lacking. The first
objective of this research is to analyze the memory length of three RNN cells - simple
RNN (SRN) [Elm90, Jor97], LSTM and GRU. It will be conducted in Sec. 3.2. Our
analysis is different from the investigation of gradient vanishing/exploding problem in
the following sense. The gradient vanishing/exploding problem occurs in the training
process while memory analysis is conducted on a trained RNN model. Based on the
analysis, we further propose a new design in Sec. 3.3 to extend the memory length of a
cell, and call it the extended long short-term memory (ELSTM).
As to the macro RNN model, one popular choice is the BRNN [SP97]. Another
choice is the encoder-decoder system, where the attention mechanism was introduced
to improve its performance in [VKK
15, BCB15]. We show that the encoder-decoder
system is not an efficient learner by itself. A better solution is to exploit the encoder-
decoder and the BRNN jointly so as to overcome their individual limitations. Following
this line of thought, we propose a new multi-task model, called the dependent bidirec-
tional recurrent neural network (DBRNN), in Sec. 3.4.
40
To demonstrate the performance of the DBRNN model with the ELSTM cell, we
conduct a series of experiments on the language modeling (LM), the part of speech
(POS) tagging and the dependency parsing (DP) problems in Sec. 3.5. Finally, conclud-
ing remarks are given and future research direction is pointed out in Sec. 3.6.
3.2 Memory Analysis of SRN, LSTM and GRU
For a large number of NLP tasks, we are concerned with finding semantic patterns from
input sequences. It was shown by Elman [Elm90] that an RNN builds an internal rep-
resentation of semantic patterns. The memory of a cell characterizes its ability to map
input sequences of certain length into such a representation. Here, we define the memory
as a function that maps elements of the input sequence to the current output. Thus, the
memory of an RNN is not only about whether an element can be mapped into the current
output but also how this mapping takes place. It was reported by Gers et al. [GSC00]
that an SRN only memorizes sequences of length between 3-5 units while an LSTM
could memorize sequences of length longer than 1000 units. In this section, we conduct
memory analysis on SRN, LSTM and GRU cells.
3.2.1 Memory of SRN
For ease of analysis, we begin with Elman’s SRN model [Elm90] with a linear hidden-
state activation function and a non-linear output activation function since such a cell
model is mathematically tractable while its performance is equivalent to Jordan’s model
[Jor97].
41
The SRN model can be described by the following two equations:
c
t
W
c
c
t 1
W
in
X
t
; (3.1)
h
t
f pc
t
q; (3.2)
where subscriptt is the time unit index,W
c
P R
N N
is the weight matrix for hidden-
state vector c
t 1
P R
N
, W
in
P R
N M
is the weight matrix of input vector X
t
P R
M
,
h
t
PR
N
in the output vector, andf pq is an element-wise non-linear activation function.
Usually,f pq is a hyperbolic-tangent or a sigmoid function. We omit the bias terms by
including them in the corresponding weight matrices. The multiplication between two
equal-sized vectors in this paper is element-wise multiplication.
By induction,c
t
can be written as
c
t
W
t
c
c
0
t
‚
k 1
W
t k
c
W
in
X
k
; (3.3)
where c
0
is the initial internal state of the SRN. Typically, we set c
0
0. Then, Eq.
(3.3) becomes
c
t
t
‚
k 1
W
t k
c
W
in
X
k
: (3.4)
As shown in Eq. (3.4), SRN’s output is a function of all proceeding elements in the
input sequence. The dependency between the output and the input allows the SRN to
retain the semantic sequential patterns from the input. For the rest of this paper, we call
a system whose function introduces dependency between the output and its proceeding
elements in the input as a system with memory.
42
Athough the SRN is a system with memory, its memory length is limited. Let
max
be the largest singular value ofW
c
. Then, we have
|W
t k
c
W
in
X
k
| ⁄ ||W
c
||
t k
|W
in
X
k
|
max
pW
c
q
t k
|W
in
X
k
|;k ⁄t: (3.5)
where || || denotes matrix norm and | | denotes vector norm, both arel
2
norm. The
max
pq denotes the largest singular value of. The inequality is derived by the definition
of matrix norm. The equality is derived by the fact that the spectral norm (l
2
norm of a
matrix) of a square matrix is equal to its largest singular value.
Here, we are only interested in the case of memory decay when
max
pW
c
q 1. Since
the contribution ofX
k
,k t, to outputh
t
decays at least in form of
max
pW
c
q
t k
, we
conclude that SRN’s memory decays at least exponentially with its memory length
t k.
3.2.2 Memory of LSTM
In Fig. 2.8, , and b denote the hyperbolic tangent function, the sigmoid function
(to be differed from the singular value operations denote as
max
or
min
with subscript)
and the multiplication operation, respectively. All of them operate in an element-wise
fashion. The LSTM cell has an input gate, an output gate, a forget gate and a constant
error carousal (CEC) module. Mathematically, the LSTM cell can be written as
c
t
pW
f
I
t
qc
t 1
pW
i
I
t
qpW
in
I
t
q; (3.6)
h
t
pW
o
I
t
qpc
t
q; (3.7)
where c
t
P R
N
, column vector I
t
P R
pM N q
is a concatenation of the current input,
X
t
P R
M
, and the previous output, h
t 1
P R
N
(i.e., I
T
t
r X
T
t
;h
T
t 1
s). Furthermore,
43
W
f
,W
i
,W
o
andW
in
are weight matrices for the forget gate, the input gate, the output
gate and the input, respectively.
Under the assumptionc
0
0, the hidden-state vector of the LSTM can be derived
by induction as
c
t
t
‚
k 1
t
„
j k 1
pW
f
I
j
q
lo ooooooo omo ooooooo on
forget gate
pW
i
I
k
qpW
in
I
k
q: (3.8)
By settingf pq in Eq. (3.2) to the hyperbolic-tangent function, we can compare outputs
of the SRN and the LSTM below:
h
SRN
t
t
‚
k 1
W
t k
c
W
in
X
k
; (3.9)
h
LSTM
t
pW
o
I
t
q
t
‚
k 1
t
„
j k 1
pW
f
I
j
q
lo ooooooo omo ooooooo on
forget gate
pW
i
I
k
qpW
in
I
k
q
: (3.10)
We see from the above thatW
t k
c
and
–
t
j k 1
pW
f
I
j
q play the same memory role for
the SRN and the LSTM, respectively.
We can find many special cases where LSTM memory length exceeds SRN regard-
less of the choice of SRN’s model parameters (W
c
,W
in
). For example
DW
f
s:t: min | pW
f
I
j
q| ¥
max
pW
c
q; @
max
pW
c
q P r0; 1q;
then
t
„
j k 1
pW
f
I
j
q
¥
max
pW
c
q
t k
;t ¥k: (3.11)
As given in Eqs. (3.5) and (3.11), the impact of inputI
k
on the output of the LSTM lasts
longer than that of the SRN. This means there always exists a LSTM whose memory
length is longer than SRN for all possible choices of SRN.
44
Conversely, to find a SRN with similar advantage to LSTM, we need to make sure
||W
t k
c
|| ¥ 1 ¥
–
t
j k 1
pW
f
I
j
q
. Although such W
c
exists, this condition would
easily leads to memory explosion. For example, one close lower bound for ||W
t k
c
||
is
min
pW
c
q
t k
, where
min
pW
c
q is the smallest singular value ofW
c
(this comes from
the fact of ||AB || ¥
min
pAq||B || and ||B ||
max
pB q ¥
min
pB q, use induction
for derivation). We need
min
pW
c
q ¥ 1, and since ||W
t k
c
|| ¥
min
pW
c
q
t k
, the SRN’s
memory will grow exponentially and end up in memory explosion. Such memory explo-
sion constraint does not exist in LSTM.
3.2.3 Memory of GRU
The GRU was originally proposed for neural machine translation [CMG
14]. It pro-
vides an effective alternative for the LSTM. Its operations can be expressed by the fol-
lowing four equations:
z
t
pW
z
X
t
U
z
h
t 1
q; (3.12)
r
t
pW
r
X
t
U
r
h
t 1
q; (3.13)
~
h
t
pWX
t
U pr
t
bh
t 1
qq; (3.14)
h
t
z
t
h
t 1
p 1 z
t
q
~
h
t
; (3.15)
whereX
t
,h
t
,z
t
andr
t
denote the input, the hidden-state, the update gate and the reset
gate vectors, respectively, and W
z
, W
r
, W , are trainable weight matrices. Its hidden-
state is also its output, which is given in Eq. (3.15). Its diagram is shown in Fig. 2.9,
where Concat denotes the vector concatenation operation.
45
By setting U
z
, U
r
and U to zero matrices, we can obtain the following simplified
GRU system:
z
t
pW
z
X
t
q; (3.16)
~
h
t
pWX
t
q; (3.17)
h
t
z
t
h
t 1
p 1 z
t
q
~
h
t
: (3.18)
For the simplified GRU with the initial rest condition, we can derive the following by
induction:
h
t
t
‚
k 1
t
„
j k 1
pW
z
X
j
q
looooooomooooooon
update gate
p1 pW
z
X
k
qqpWX
k
q: (3.19)
By comparing Eqs. (3.8) and (3.19), we see that the update gate of the simplified
GRU and the forget gate of the LSTM play the same role. In other words, there is
no fundamental difference between GRU and LSTM. Such finding is substantiated
by the non-conclusive performance comparison between GRU and LSTM conducted
in [GSK
17, CGCB14, KJL15].
Because of the presence of the multiplication term introduced by the forget gate and
the update gate in Eqs. (3.8) and (3.19), the longer the distance of t k, the smaller
these terms. Thus, the memory responses of LSTM and GRU toI
k
diminish inevitably
as t k becomes larger. This phenomenon occurs regardless of the choice of model
parameters. For complex language tasks that require long memory responses such as
sentence parsing, LSTM’s and GRU’s memory decay may have significant impacts to
their performance.
46
3.3 Extended Long Short-Term Memory (ELSTM)
To address this design limitation, we introduce a scaling factor to compensate the fast
decay of the input response. This leads to a new solution called the extended LSTM
(ELSTM). The ELSTM cell is depicted in Fig. 3.1, wheres
i
PR
N
,i 1; ;t 1 is
the trainable input scaling vectors
Figure 3.1: The diagrams of the ELSTM cell.
The ELSTM cell can be described by
c
t
pW
f
I
t
qc
t 1
s
t
pW
i
I
t
qpW
in
I
t
q; (3.20)
h
t
pW
o
I
t
qpc
t
q: (3.21)
a bias termb P R
N
forc
t
is omitted in Equation 3.21. As shown above, we introduce
scaling factor,s
i
,i 1; ;t 1, to the ELSTM to increase or decrease the impact of
inputI
i
in the sequence.
47
To show that the ELSTM has longer memory than the LSTM, we first derive a closed
form expression ofh
t
as
h
t
pW
o
I
t
q
t
‚
k 1
s
k
t
„
j k 1
pW
f
I
j
q
pW
i
I
k
qpW
in
I
k
q
: (3.22)
Then, we can find the following special case:
Ds
k
s:t:
s
k
t
„
j k 1
pW
f
I
j
q
¥
t
„
j k 1
pW
f
I
j
q
@W
f
: (3.23)
By comparing Eq. (3.23) with Eq. (3.11), we conclude that there always exists an
ELSTM whose memory is longer than LSTM for all choices of LSTM. Conversely,
we cannot find such LSTM with similar advantage to ELSTM. This demonstrates the
ELSTM’s system advantage by design to LSTM.
The numbers of parameters used by various RNN cells are compared in Table 3.1,
where X
t
P R
M
, h
t
P R
N
and t 1; ;T . As shown in Table 3.1, the number
of parameters of the ELSTM cell depends on the maximum length, T , of the input
sequences, which makes the model size uncontrollable. To address this problem, we
choose a fixedT
s
(withT
s
T ) as the upper bound on the number of scaling factors,
and set s
k
s
pk 1q mod Ts 1
, if k ¡ T
s
and k starts from 1, where mod denotes the
modulo operator. In other words, the sequence of scaling factors is a periodic one with
periodT
s
, so the elements in a sequence that are distanced by the length ofT
s
will share
the same scaling factor.
Table 3.1: Comparison of Parameter Numbers.
Cell Number of Parameters
LSTM 4N pM N 1q
GRU 3N pM N 1q
ELSTM 4N pM N 1q N pT 1q
48
The ELSTM cell with periodic scaling factors can be described by
c
t
pW
f
I
t
qc
t 1
s
ts
pW
i
I
t
qpW
in
I
t
q; (3.24)
h
t
pW
o
I
t
qpc
t
q; (3.25)
wheret
s
p t 1q modT
s
1. We observe that the choice ofT
s
affects the network
performance. Generally speaking, a smallT
s
value is suitable for simple language tasks
that demand shorter memory while a larger T
s
value is desired for complex ones that
demand longer memory. For the particular sequence-to-sequence (seq2seq [SVL14,
VKK
15]) RNN models, a largerT
s
value is always preferred. We will elaborate the
parameter settings in Sec. 3.5.
3.3.1 Study of Scaling Factor
To examine the memory capability of the scaling factor, we carry out the following
experiment. The RNN cell is asked to tell whether a special element “A” exists in the
sequence of a single “A” and multiple “B”s of length T . The training data contains
T number of positive samples where “A” locates from position 1 toT , and 1 negative
sample where there is no “A” exists. The cell takes in the whole sequence and generates
the output at time stepT as shown in Fig. 3.2.
Figure 3.2: Experiment of estimating the presence of “A”.
49
We would like to see the memory response of LSTM and ELSTM to “A”. If “A”
lies at the beginning of the sequence, the LSTM’s memory decay may cause it lose the
information of “A”’s presence. The memory responses of LSTM and ELSTM to the
inputI
k
are calculated as:
mr
LSTM
k
T
„
j k 1
pW
f
I
j
q
pW
i
I
k
qpW
in
I
k
q; (3.26)
mr
ELSTM
k
s
k
T
„
j k 1
pW
f
I
j
q
pW
i
I
k
qpW
in
I
k
q; (3.27)
The detailed model settings can be found in Table. 3.2
Table 3.2: Network parameters for the toy experiment.
Number of RNN layers 1
Embedding layer vector size 2
Number of RNN cells 1
Batch size 5
We carry out multiple such experiments by increase the sample lengthT by 1 at a
time and see when LSTM cannot keep up with ELSTM. We train the LSTM and ELSTM
models with equal number of epochs until both report no further change of training loss.
We found whenT 60, LSTM’s training loss starts to plateau while ELSTM can
further decrease to zero. As a result, LSTM starts to “forget” when T ¡ 60. The
detailed plot of the memory responses for two particular samples are shown in Fig. 3.3
Fig. 3.3a shows the memory response of trained LSTM and ELSTM on a sample
withT 10 with “A” at position 9. It can be seen that although both LSTM and ELSTM
have stronger memory response at “A”, the ELSTM attends better than LSTM since its
response at position 10 is smaller than LSTM’s. We can also find that the scaling factor
has larger value at the beginning and then slowly decreases as the location comes closer
50
(a) (b)
Figure 3.3: Comparison of memory response between LSTM and ELSTM.
to the end. It then spikes at position 9. We can imagine that the scaling factor is doing
its compensating job at both ends of the sequence.
Fig. 3.3b shows the memory response of trained LSTM and ELSTM on a sample
withT 60 with “A” at location 30. In this case, the LSTM is not able to “remember”
the presence of “A” and it does not have strong response to it. The scaling factor is doing
its compensating job at the first half the sequence and especially in the middle and this
causes strong ELSTM’s response to “A”.
Even though scaling factor cannot adaptively change its value once it is trained, it is
able to learn the pattern of model’s rate of memory decay and the averaged importance
of that position in the training set.
It is important to point out that the scaling factor needs to be initialized to 1 for each
cell.
3.4 Dependent BRNN (DBRNN) Model
A single RNN cell is rarely used in practice due to its limited capability in modeling
real-world problems. To build a more powerful RNN model, it is often to integrate
51
several cells with different probabilistic models. To give an example, the sequence-to-
sequence problem demands an RNN model to predict an output sequence, tY
t
u
T
1
t 1
with
Y
i
P R
N
, based on an input sequence, tX
t
u
T
t 1
with X
i
P R
M
, where T and T
1
are
lengths of the input and the output sequences, respectively. To solve this problem, we
propose a macro RNN model, called the dependent BRNN (DBRNN), in this section.
Our design is inspired by pros and cons of two RNN models; namely, the bidirectional
RNN (BRNN) [SP97] and the encoder-decoder design [CMG
14]. We will review the
BRNN and the encoder-decoder in Sec. 3.4.1. Then, the DBRNN is proposed in Sec.
3.4.2.
3.4.1 BRNN and Encoder-Decoder
As its name indicates, BRNN takes inputs in both forward and backward directions as
shown in Fig. 2.10, and it has two RNN cells to take in the input: one takes the input in
the forward direction, the other takes the input in the backward direction.
The motivation for BRNN is to fully utilize the input sequence if future information
(tX
i
u
T
i t 1
) is accessible. This is especially helpful if current outputY
t
is also a function
of future inputs. The conditional probability density function of BRNN is in form of
p
t
P pY
t
|tX
i
u
T
i 1
q W
f
p
f
t
W
b
p
b
t
; (3.28)
^
Y
t
arg max
Yt
p
t
; (3.29)
where
52
p
f
t
P pY
t
|tX
i
u
t
i 1
q; (3.30)
p
b
t
P pY
t
|tX
i
u
T
i t
q; (3.31)
andW
f
andW
b
are trainable weights,
^
Y
t
is the predicted output element at time step
t. So the output is a combination of the density estimation of a forward RNN and the
output of a backward RNN. Due to the bidirectional design, the BRNN can utilize the
information of the entire input sequence to predict each individual output element. One
example where such treatment is helpful is generating a sentence like “this is an apple”
for language modeling (predicts the next word given proceeding words in a sentence). In
this case, the word “an” strongly associates with its following word “apple”, in a forward
directional RNN model, it would find difficulty in generating “an” before “apple”.
Encoder-decoder was first proposed for machine translation (MT) along with GRU
in [CMG
14]. It was motivated to handle the situation whenT
1
T . It has two RNN
cells: an encoder and a decoder. The detailed design of one of the early proposals
[SVL14] of encoder-decoder RNN model is illustrated in Fig. 2.11.
As can be seen in Fig. 2.11, the encoder (denoted by Enc) takes the input sequence
of lengthT and generates its outputh
Enc
i
and hidden statec
Enc
i
, wherei P t1;:::;T u. In
seq2seq model, the encoder’s hidden state at time stepT is used as the representation of
the input sequence. The decoder then utilizes the hidden state information to generate
the output sequence of length T
1
by initializing its hidden state c
Dec
1
as c
Enc
T
. So the
decoding process starts after the encoder has processed the entire input sequence. In
practice, the input to the decoder at time step 1 is a pre-defined start decoding symbol.
At the following time steps, the previous outputY
t 1
will be used as input. The decoder
will stop the decoding process if a special pre-defined stopping symbol is generated.
53
As compare with BRNN, the encoder-decoder is not only advantageous in its ability
in handling input/output sequences of different length but also capable in generating
better aligned output sequences by explicitly feeding previous predicted outputs back to
its decoder. Thus, the encoder-decoder estimates the following density function
p
t
P pY
t
|t
^
Y
i
u
t 1
i 1
; tX
i
u
T
i 1
q (3.32)
^
Y
t
arg max
Yt
p
t
@t P t1;:::;T
1
u: (3.33)
One example where such treatment is helpful is translation from a sentence in Chi-
nese “你来自哪里” to English “where are you from” where “你” corresponds to “you”,
“来自” corresponds to “from”, “哪里” corresponds to “where”, and the word “are” has
no corresponding Chinese alignment. So the word “are” is more pertinent to the word
of “where” and “you” in the translated English sentence than to the source sentence in
Chinese. Since the decoder of seq2seq has no bidirectional design, “are” cannot context
on “you”, but nevertheless, “where” should give strong hint as to what word should be
generated next.
For better encoder-decoder aliment, various attention mechanism has been proposed
for encoder-decoder model. In [VKK
15,BCB15], additional weighted connections are
introduced to connect the decoder to the hidden state of the encoder.
On the other hand, the encoder-decoder system is vulnerable to previous erroneous
predictions in the forward path. Recently, the BRNN was introduced to the encoder by
Bahdanau et al. [BCB15], yet their design does not address the erroneous prediction
problem.
54
3.4.2 DBRNN Model and Training
Being motivated by the observations in Sec. 3.4.1, we propose a multi-task BRNN
model, called the dependent BRNN (DBRNN), to achieve the following objectives:
p
t
W
f
p
f
t
W
b
p
b
t
(3.34)
^
Y
f
t
arg max
Yt
p
f
t
; (3.35)
^
Y
b
t
arg max
Yt
p
b
t
; (3.36)
^
Y
t
arg max
Yt
p
t
(3.37)
where
p
f
t
P pY
t
|tX
i
u
T
i 1
; t
^
Y
f
i
u
t 1
i 1
q; (3.38)
p
b
t
P pY
t
|tX
i
u
T
i 1
; t
^
Y
b
i
u
T
1
i t 1
q; (3.39)
p
t
P pY
t
|tX
i
u
T
i 1
q; (3.40)
and W
f
and W
b
are trainable weights. As shown in Eqs. (3.35), (3.36) and (3.37),
the DBRNN has three learning objectives: 1) the target sequence for the forward RNN
prediction, 2) the reversed target sequence for the backward RNN prediction, and 3) the
target sequence for the bidirectional prediction.
The DBRNN model is shown in Fig. 3.4. It consists of a lower and an upper BRNN
branches. At each time step, the input to the forward and the backward parts of the
upper BRNN is the concatenated forward and backward outputs from the lower BRNN
branch. The final bidirectional prediction is the pooling of both the forward and the
backward predictions. We will show later that this design will make the DBRNN robust
to previous erroneous predictions.
55
Figure 3.4: The DBRNN model.
LetF pq be the cell function. The input is fed into the forward and backward RNN
of the lower BRNN branch as
h
f
t
F
f
l
x
t
;c
f
lpt 1q
; h
b
t
F
b
l
x
t
;c
b
lpt 1q
; h
t
h
f
t
h
b
t
; (3.41)
wherec andl denote the cell hidden state and the lower BRNN, respectively. The final
output, h
t
, of the lower BRNN is the concatenation of the output, h
f
t
, of the forward
RNN and the output,h
b
t
, of the backward RNN. Similarly, the upper BRNN generates
the final outputp
t
as
p
f
t
F
f
u
h
t
;c
f
upt 1q
; p
b
t
F
b
u
h
t
;c
b
upt 1q
; p
t
W
f
p
f
t
W
b
p
b
t
; (3.42)
where u denotes the upper BRNN. To generate forward prediction
^
Y
f
t
and backward
prediction
^
Y
b
t
, the forward and backward paths of the upper BRNN branch are sepa-
rately trained by the original and the reversed target sequences, respectively. The results
56
of forward and backward predictions of the upper RNN branch are then combined to
generate the final result.
There are three errors: 1) forward prediction errore
f
for
^
Y
f
t
, 2) backward prediction
error e
b
for
^
Y
b
t
, and 3) bidirectional prediction error e for
^
Y
t
. To train the proposed
DBRNN,e
f
is backpropagated through time to the upper forward RNN and the lower
BRNN,e
b
is backpropagated through time to the upper backward RNN and the lower
BRNN, ande is backpropagated through time to the entire model.
As it can been seen that DBRNN being an encoder-decoder can better handle out-
put alignment. By introducing the bidirectional design to its decoder, DBRNN is also
better than encoder-decoder in handling previous erroneous predictions. To show that
DBRNN is more robust to previous erroneous predictions than one-directional models,
we compare their cross entropy defined as
l K
‚
k 1
p
t;k
logp^ p
t;k
q; (3.43)
where K is the total number of classes (e.g. the size of vocabulary for the language
task), ^ p
t
is the predicted distribution, andp
t
is the ground truth distribution withk
1
as
the ground truth label. It is in form of one-hot vector. That is,
p
t
p
1;k
1; ;
k
1
;k
1; ;
K;k
1 q
T
; k 1; ;K;
where
k;k
1 is the Kronecker delta function. Based on Eq. (3.34), l can be further
expressed as
l K
‚
k 1
p
t;k
logpW
f
k
^ p
f
t;k
W
b
k
^ p
b
t;k
q; (3.44)
logpW
f
k
1
^ p
f
t;k
1
W
b
k
1 ^ p
b
t;k
1 q: (3.45)
57
We can selectW
f
k
1
andW
b
k
1 such thatW
f
k
1
^ p
f
t;k
1
W
b
k
1 ^ p
b
t;k
1 is greater than ^ p
f
t;k
1
and ^ p
b
t;k
1.
Then, we obtain
l K
‚
k 1
logp^ p
f
tk
q; (3.46)
l K
‚
k 1
logp^ p
b
tk
q: (3.47)
The above two equations indicate that there always exists a DBRNN with better
performance as compared to encoder-decoder regardless of which parameters the
encoder-decoder chose. So DBRNN does not have the encoder-decoder’s model limi-
tations.
It is worthwhile to compare the proposed DBRNN and the bi-attention model in
Cheng et al. [CFH
16]. Both of them have bidirectional predictions for the output, yet
there are three main differences. First, the DBRNN provides a generic solution to the
SISO problem without being restricted to dependency parsing. The target sequences
in training (namely,
^
Y
f
t
,
^
Y
b
t
and
^
Y
t
) are the same for the DBRNN while the solution
in [CFH
16] has different target sequences. Second, the attention mechanism is used
in [CFH
16] but not in the DBRNN.
3.5 Experiments
3.5.1 Experimental Setup
In the experiments, we compare the performance of five RNN macro-models:
1. basic one-directional RNN (basic RNN);
2. bidirectional RNN (BRNN);
58
3. sequence-to-sequence (seq2seq) RNN [SVL14] (a variant of the encoder-
decoder);
4. seq2seq with attention [VKK
15];
5. dependent bidirectional RNN (DBRNN), which is proposed in this work.
For each RNN model, we compare three cell designs: LSTM, GRU, and ELSTM.
We conduct experiments on three problems: part of speech (POS) tagging, language
modeling
1
and dependency parsing (DP). We report the testing accuracy for the POS
tagging problem, the perplexity (i.e. the natural exponential of the model’s cross-entropy
loss) for LM and the unlabeled attachment score (UAS) and the labeled attachment
score (LAS) for the DP problem. The POS tagging task is an easy one which requires
shorter memory while the LM and DP task demand longer memory. For the latter two
tasks, there exist more complex relations between the input and the output. For the
DP problem, we compare our solution with the GRU-based bi-attention model (bi-Att).
Furthermore, we compare the DBRNN using the ELSTM cell with two other non-RNN-
based neural network methods. One is transition-based DP with neural network (TDP)
proposed by Chen et al. [CC14]. The other is convolutional seq2seq (ConvSeq2seq)
proposed by Gehring et al. [GAMY
17]. For the proposed DBRNN, we show the result
for the final combined output (namely,p
t
). We adoptT
s
1 in the basic RNN, BRNN,
and DBRNN models andT
s
100 in the other two seq2seq models for the POS tagging
problem. We useT
s
3 andT
s
100 in all models for the LM problem and the DP
problem, respectively.
The training, validation and testing dataset used for LM is from the Penn Treebank
(PTB) [MSM93]. The PTB has 42,068, 3,370 and 3,761 training, validation and testing
1
It asks the machine to predict the next word given all preceding words in a sentence. It is also known
as the automatic sentence generation task.
59
sentences respectively. It has in total 10,000 tokens. The training dataset used for the
POS tagging and DP problems are from the Universal Dependency 2.0 English branch
(UD-English). It contains 12,543 sentences and 14,985 unique tokens. The test dataset
in both experiments is from the test English branch (gold, en.conllu) of CoNLL 2017
shared task development and test data. The input to the POS tagging and the DP prob-
lems are the stemmed and lemmatized sequences (column 3 in CoNLL-U format). The
target sequence for the POS tagging is the universal POS tag (column 4). The target
sequence for the DP is the interleaved dependency relation to the headword (relation,
column 8) and its headword position (column 7). As a result, the length of the target
sequence for the DP is twice of the length of the input sequence.
Table 3.3: Training dataset.
# Training # Validation # Testing # Tokens
PTB 42,068 3,370 3,761 10,000
UD 2.0 12,543 2,002 2,077 14,985
Table 3.4: Network parameters and training details.
Parameter POS & DP LM
Embedding layer vector size 512 5
Number of RNN cells 512 5
Batch size 20 50
Number of RNN layers 1
Training steps 11 epochs
Learning rate 0.5
Optimizer AdaGrad [Duc11]
The input is first fed into a trainable embedding layer [BDVJ03] before it is sent to
the actual network. Table 3.4 shows the detailed network and training specifications. We
do not finetune network hyper-parameters or apply any engineering trick (e.g. feeding
additional inputs other than the raw embedded input sequences) for the best possible
60
performance since our main goal is to compare the performance of the LSTM, GRU,
ELSTM cells under various macro-models.
Table 3.5: LM test perplexity
LSTM GRU ELSTM
BASIC RNN 267.47 262.39 248.60
BRNN 78.56 82.83 71.65
Seq2seq 296.92 293.99 266.98
Seq2seq with Att 17.86 232.20 11.43
DBRNN 9.80 24.10 6.18
Table 3.6: DP test results (UAS/LAS %)
LSTM GRU ELSTM
BASIC RNN 43.24/25.28 45.24/29.92 58.49/36.10
BRNN 37.88/25.26 16.86/8.95 55.97/35.13
Seq2seq 29.38/6.05 36.47/13.44 48.58/24.05
Seq2seq with Att 31.82/16.16 43.63/33.98 64.30/52.60
DBRNN 51.38/39.71 52.23/37.25 61.35/43.32
Bi-Att [CFH
16]
2
59.97/44.94
Table 3.7: POS tagging test accuracy (%)
LSTM GRU ELSTM
BASIC RNN 87.30 87.51 87.44
BRNN 89.55 89.39 89.29
Seq2seq 24.43 35.27 50.42
Seq2seq with Att 31.34 34.60 81.72
DBRNN 89.86 89.06 89.28
3.5.2 Comparison of RNN Models
The results of the LM, the DP and the POS tagging are shown in Tables 3.5 - 3.6,
respectively.
2
The result is generated by using exactly the same settings in Table. 3.4. We do not feed in the network
with information other than input sequence itself.
61
Figure 3.5: The training perplexity vs. training steps of different cells.
Figure 3.6: The training perplexity vs. training steps of different macro models.
62
The training perplexity of different cell models and the macro models are shown
in Fig. 3.5 and 3.6 respectively. We see that the proposed ELSTM cell outperforms
the LSTM and GRU cells in most RNN models. This is especially true for complex
language tasks like LM and DP, where the ELSTM cell outperforms other cell designs
by a significant margin. The ELSTM cell even outperforms the bi-Att model, which
was designed specifically for the DP task. This demonstrates the effectiveness of the
sequence of scaling factors adopted by the ELSTM cell. It allows the network to retain
longer memory with better attention. For the simple POS tagging task, ELSTM also
shows equal or better performance in comparison to other cell models. Overall, ELSTM
delivers good performance across tasks with different complexity.
The ELSTM cell with largeT
s
value perform particularly well for the seq2seq (with
and without attention) model. The hidden state,c
t
, of ELSTM cell is more expressive
in representing patterns over a longer distance. Since the seq2seq design relies on the
expressive power of a hidden state, ELSTM has a clear advantage.
Generally speaking, the scaling factor number depends on the memory length
required by the specific task. For example, the output of POS tagging task is mostly
a function of its immediate before, current and after inputs. Thus, the observation of
using only one scaling factor gave the optimal performance for POS tagging makes
sense since the memory length required for this task is mostly 1. The same observa-
tion goes for the language modeling task, where the memory length required is widely
believed to be 3-5 and the best performance was achieved when the scaling factor num-
ber is 3. On the other hand, the encoder-decoder RNN behaves differently, where the
usage of more than the required scaling factor number tends to yield better performance.
This may have something to do with the memory length required at the decoder side is
not the same as the one for the encoder. The underlining mechanism demands future
63
study. Our recommendation is to estimate the required memory length before hyperpa-
rameter selection. If the estimation is not possible or reliable, one can begin by setting
the scaling factor number to the maximum training sequence length and searching for
its optimal value by binary search.
For the DBRNN, we see that the it achieves the best performance across different
macro models for the LM problem. It also outperforms the BRNN and the seq2seq in
both the POS tagging and the DP problems regardless of the cell types. This shows its
robustness. The DBRNN may overfit to the training data in other cases. One may use
a proper regularization scheme in the training process to address it, which will be an
interesting future work item.
To substantiate our claim in Sec. 3.2, we conduct additional experiments to show
the robustness of the ELSTM cell and the DBRNN. Specifically, we compare the per-
formance of the same five models with LSTM, and ELSTM with I
t
X
t
for the
same language tasks. We do not include the GRU cell since it inherently demands
I
T
t
r X
T
t
;h
T
t 1
s. The convergence behaviors of I
t
X
t
and I
T
t
r X
T
t
;h
T
t 1
s with
the LSTM, ELSTM cell for the DP problem are shown in Fig. 3.7. We see that the
ELSTM does not behave much differently betweenI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s while
the LSTM does. This shows the effectiveness of the ELSTM design regardless of the
input. More performance comparison will be provided in the Appendix A.
3.5.3 Comparison between ELSTM and Non-RNN-based Methods
As stated earlier, the ELSTM design is more capable of extending the memory and
capturing complex SISO relationships than other RNN cells. In this subsection, we
compare the DP performance of two models built upon the ELSTM cell (namely, the
DBRNN and the seq2seq with attention) and two non-RNN-based neural network based
64
Figure 3.7: The training perplexity vs. training steps of different models withI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s for the DP task.
methods (i.e., the TDP [CC14] and the convseq2seq [GAMY
17]). The TDP is a hand-
crafted method based on a parsing tree, and its neural network is a multi-layer perceptron
with one hidden layer. Its neural network is used to predict the transition from a tail word
to its headword. The convseq2seq is an end-to-end CNN with an attention mechanism.
We used the default settings for the TDP and the convseq2seq as reported in [CC14]
and [GAMY
17], respectively. For the TDP, we do not use the ground truth POS tags
but the predicted dependency relation labels as the input to the parsing tree for the next
prediction.
We see from Table 3.8 that the ELSTM-based models learn much faster than the
CNN-based convseq2seq model with fewer parameters. The convseq2seq uses dropout
while the ELSTM-based models do not. It is also observed that convseq2seq does not
converge if Adagrad is used as its optimizer. The ELSTM-based seq2seq with attention
65
Table 3.8: DP test accuracy (%) and system settings
Seq2seq-E DBRNN-E Convseq2seq TDP
UAS 64.30 61.35 52.55 62.29
LAS 52.60 43.32 44.19 52.18
Training steps 11 epochs 11 epochs 11 epochs 11 epochs
# parameters 12,684,468 16,460,468 22,547,124 950,555
Pretrained embedding No No No Yes
End-to-end Yes Yes Yes No
Regularization No No No Yes
Dropout No No Yes Yes
Optimizer AdaGrad AdaGrad NAG [Nes83] AdaGrad
Learning rate 0.5 0.5 0.25 0.01
Embedding size 512 512 512 50
Encoder layers 1 N/A 4 N/A
Decoder layers 1 N/A 4 N/A
Kernel size N/A N/A 3 N/A
Hidden layer size N/A N/A N/A 200
even outperforms the TDP, which was specifically designed for the DP task. Without a
good pretrained word embedding scheme, the UAS and LAS of TDP drop drastically to
merely 8:93% and 0:30% respecively.
3.6 Conclusion and Future Work
Although the memory of the LSTM and GRU celles fades slower than that of the SRN,
it is still not long enough for complicated language tasks such as dependency parsing.
To address this issue, we proposed the ELSTM to enhance the memory capability of an
RNN cell. Besides, we presented a new DBRNN model that has the merits of both the
BRNN and the encoder-decoder. It was shown by experimental results that the ELSTM
outperforms other RNN cell designs by a significant margin for complex language tasks.
The DBRNN model is superior to the BRNN and the seq2seq models for simple and
complex language tasks. Furthermore, the ELSTM-based RNN models outperform the
66
CNN-based convseq2seq model and the handcrafted TDP. There are interesting issues to
be explored furthermore. For example, is the ELSTM cell also helpful in more sophis-
ticated RNN models such as the deep RNN? Is it possible to make the DBRNN deeper
and better? They are left for future study.
67
Chapter 4
Sequence Analysis via Dimension
Reduction Techniques
4.1 Motivation
As discussed in Sec. 1.1, DR is used to deal with the “curse of dimensionality” problem
in NLP. The problem to the existing NN based solutions is they are limited in model-
ing “sequences of words”, which is called the sequence-to-vector (seq2vec) problem,
for two reasons. First, word embedding is trained on some particular dataset using the
stochastic gradient descent method, which could lead to overfitting [LLXZ16] easily.
Second, the vector space obtained by word embedding is still too large, it is desired
to convert a sequence of words to an even more compact form. The problem to the
non-NN based soluton like PCA is that although the PCA has some nice properties such
as maximum information preservation [Lin88] between its input and output under cer-
tain constraints, its computational complexity is exceptionally high as the dataset size
becomes large. Furthermore, most non-RNN-based dimension reduction methods, such
as [DDF
90, Uys16, CZLZ16], do not consider the positional correlation between ele-
ments in a sequence but adopt the “bag-of-word” (BoW) representation. The sequential
information is lost in such a dimension reduction procedure.
To address the above-mentioned shortcomings, a novel technique, called the tree-
structured multi-stage PCA (TMPCA), was proposed in [SHK18]. The TMPCA method
has several interesting properties as summarized below.
68
1. High efficiency. Reduce the input data dimension with a small model size at low
computational complexity.
2. Low information loss. Maintain high mutual information between an input and
its dimension-reduced output.
3. Sequential preservation. Preserve the positional relationship between input ele-
ments.
4. Unsupervised learning. Do not demand labeled training data.
These properties are beneficial to classification tasks that demand low dimensional-
ity yet high information amount of transformed (or processed) data. It also relaxes the
burden of data labeling in the training stage.
In this work, we present the TMPCA method and apply it to several text classifica-
tion problems such as spam email detection, sentiment analysis, news topic identifica-
tion, etc. The information preserving property of the TMPCA method is demonstrated
by examining the mutual information between its input and output. Also, we provide
extensive experimental results on large text classification datasets.
4.2 Tree-structured Multi-stage PCA (TMPCA)
In essence, TMPCA is a tree-structured multi-stage PCA method whose input at every
stage is two adjacent elements in an input sequence without overlap. The reason for
every two elements rather than other number of elements is due to the computational
efficiency of such an arrangement. This will be elaborated in Sec. 4.2.2. The block
diagram of TMPCA with a single sequence tw
1
;:::;w
N
u as its input is illustrated in
Fig. 4.1. The input sequence length isN, whereN is assumed to be a number of the
power of 2 for ease of discussion below. We will relax such a constraint for practical
69
implementation in Sec. 4.3. We usez
s
j
to denote thejth element in the output sequence
of stages (or equivalently, the input sequence of stages 1 if such a stage exists). It is
obvious that the final outputY is alsoz
log
2
N
1
.
Figure 4.1: The Block diagram of the TMPCA method.
4.2.1 Training of TMPCA
To illustrate how TMPCA is trained, we use an example of a training dataset with two
sequences, each of which has four numericalized elements. Each element is a column
vector of sizeD, denoted asw
i
j
, wherei indicates the corresponding sequence andj is
the position of the element in that sequence. At each stage of the TMPCA tree, every
two adjacent elements without overlap are concatenated to form one vector of dimension
70
2D. It serves as a sample for PCA training at that stage. Thus, the training data matrix
for PCA at the first stage can be written as
pw
1
1
q
T
pw
1
2
q
T
pw
1
3
q
T
pw
1
4
q
T
pw
2
1
q
T
pw
2
2
q
T
pw
2
3
q
T
pw
2
4
q
T
:
The trained PCA transform matrix at stages is denoted asU
s
. It reduces the dimen-
sion of the input vector from 2D toD. That is,U
s
PR
D 2D
. The training matrix at the
first stage is then transformed byU
1
to
pz
1
1
q
T
pz
1
2
q
T
pz
1
3
q
T
pz
1
4
q
; z
1
1
U
1
p
w
1
1
w
1
2
q; z
1
2
U
1
p
w
1
3
w
1
4
q; z
1
3
U
1
p
w
2
1
w
2
2
q; z
1
4
U
1
p
w
2
3
w
2
4
q;
After that, we rearrange the elements on the transformed training matrix to form
pz
1
1
q
T
pz
1
2
q
T
pz
1
3
q
T
pz
1
4
q
T
:
It serves as the training matrix for the PCA at the second stage. We repeat the
training data matrix formation, the PCA kernal determination and the PCA transform
steps recursively at each stage until the length of the training samples becomes 1. It is
apparent that, after one-stage TMPCA, the sample length is halved while the element
vector size keeps the same asD. The dimension evolution from the initial input data to
the ultimate transformed data is shown in Table 4.1. Once the TMPCA is trained, we
71
can use it to transform test data by following the same steps except that we do not need
to compute the PCA transform kernels at each stage.
Table 4.1: Dimension evolution from the input to the output in the TMPCA method.
Sequence length Element vector size
Input sequence N D
Output sequence 1 D
4.2.2 Computational Complexity
We analyze the time complexity of TMPCA training in this section. Consider a training
dataset ofM samples, where each sample is of lengthN with element vectors of dimen-
sionD. To determine the PCA model for this training matrix of dimensionR
M ND
, it
requires OpMN
2
D
2
q to compute the covariance matrix, and OpN
3
D
3
q to compute the
eigenvalues of the covariance matrix. Thus, the complexity of PCA can be written as
Opf
PCA
q O
N
3
D
3
MN
2
D
2
: (4.1)
The above equation can be simplified by comparing the value ofM withND. We do
not pursue along this direction furthermore since it is problem dependent.
Suppose that we concatenate non-overlappingP adjacent elements at each stage of
TMPCA. The dimension of the training matrix at stages isM
N
P
s
PD. Then, the total
computational complexity of TMPCA can be written as
Opf
TMPCA
q O
log
P
N
‚
s 1
pPDq
3
M
N
P
s
pPDq
2
;
O
pP
3
log
P
N qD
3
M
P
2
P 1
pN 1qD
2
: (4.2)
72
The complexity of TMPCA is an increasing function inP . This can be verified by
non-negativity of its derivative with respect toP . Thus, the worst case isP N, which
is simply the traditional PCA applied to the entire samples in a single stage. When
P 2, the TMPCA achieves its optimal efficiency. Its complexity is
Opf
TMPCA
q O
8plog
2
N qD
3
4M pN 1qD
2
;
O
2plog
2
N qD
3
M pN 1qD
2
: (4.3)
By comparing Eqs. (4.3) and (4.1), we see that the time complexity of the traditional
PCA grows at least quadratically with sentence lengthN (sinceP N) while that of
TMPCA grows at most linearly withN.
4.2.3 System Function
To analyze the properties of TMPCA, we derive its system function in closed form
in this section. In particular, we will show that, similar to PCA, TMPCA is a linear
transform and its transformation matrix has orthonormal rows. For the rest of this paper,
we assume that the length of the input sequence isN N 2
L
, whereL is the total stage
number of TMPCA. The input is mean-removed so that its mean ism0.
We denote the element of the input sequence by w
j
, where w
j
P R
D
and j P
t1;:::;N u. Then, the input sequenceX to TMPCA is a column vector in form of
X
T
r w
T
1
; ;w
T
N
s: (4.4)
We decompose PCA transform matrices,U
s
, at stages into two equal-sized block matri-
ces as
U
s
r U
s
1
;U
s
2
s; (4.5)
73
whereU
s
j
PR
D D
, and wherej P t1; 2u. The output of TMPCA isY PR
D
With notations inherited from Sections 4.2.1 and 4.2.2, we can derive the closed-
form expression of TMPCA by induction (see Appendix B). That is, we have
Y UX; (4.6)
U r U
1
;:::;U
N
s; (4.7)
U
j
L
„
s 1
U
s
f
j;s
; @j P t1;:::;N u (4.8)
f
j;s
b
L
pj 1q
s
1; @j;s: (4.9)
where b
L
pxq
s
is the sth digit ofL-binarized form ofx. TMPCA is a linear transform
as shown in Eq. 4.6. Also, since there always exist real valued eigenvectors to form
the PCA transform matrix,U,U
j
and tU
s
j
u
2
j 1
are all real valued matrices. The detailed
derivation of TMPCA’s system function is shown below:
For stages ¡ 1, we have:
z
s
j
U
s
z
s 1
2j 1
z
s 1
2j
U
s
1
z
s 1
2j 1
U
s
2
z
s 1
2j
; (4.10)
wherej 1; ;
N
2
s
. Whens 1, we have
z
1
j
U
1
1
w
2j 1
U
1
2
w
2j
(4.11)
From Eqs. (1) and (2), we get
Y z
L
1
N
‚
j 1
L
„
s 1
U
s
f
j;s
w
j
; (4.12)
f
j;s
b
L
pj 1q
s
1; (4.13)
74
whereb
L
pxq
s
is thesth digit of the binarization ofx of lengthL. Eq. (3) can be further
simplified to Eq. (4.6). For example, ifN 8, we obtain
Y U
3
1
U
2
1
U
1
1
w
1
U
3
1
U
2
1
U
1
2
w
2
U
3
1
U
2
2
U
1
1
w
3
U
3
1
U
2
2
U
1
2
w
4
U
3
2
U
2
1
U
1
1
w
5
U
3
2
U
2
1
U
1
2
w
6
U
3
2
U
2
2
U
1
1
w
7
U
3
2
U
2
2
U
1
2
w
8
: (4.14)
The superscripts ofU
s
j
are arranged in the stage order ofL;L 1;:::; 1. The subscripts
are shown in Table B.1. This is the reason that binarization is required to express the
subscripts in Eqs. (4.6) and (3).
Table 4.2: Subscripts ofU
s
j
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
1,1,1 1,1,2 1,2,1 1,2,2 2,1,1 2,1,2 2,2,1 2,2,2
To show that U has orthonormal rows, we first examine the properties of matrix
K r U
1
;U
2
s. By setting
A L
„
s 2
U
s
f
1;s
L
„
s 2
U
s
f
2;s
;
we obtainK r AU
1
1
;AU
1
2
s. Since matrix rU
1
1
;U
1
2
s is a PCA transform matrix, it has
orthonormal rows. Denote ¡
ij
as the inner product between theith row andjth row
of matrix , we conclude that the K ¡
ij
A ¡
ij
using the following property.
Lemma 4.2.1. Given K r AB
1
;AB
2
s, where rB
1
;B
2
s has orthonormal rows, then
K ¡
ij
A ¡
ij
.
We then let
K
s
m
r A
s
m
U
s
1
;A
s
m
U
s
2
s; (4.15)
75
wheres P t1;:::;Lu indicates the stage, andm P t1;:::;
N
2
s
u, and
A
s
m
L
„
k s 1
U
k
f
m;k s
; andA
L
1
mI; (4.16)
where mI is the identity matrix. At stage 1, K
1
m
r U
2m 1
;U
2m
s, so U rK
1
1
;:::;K
1
N {2
s. Since U ¡
ij
N {2
m 1
K
1
m
¡
ij
, according to 4.2.1, Eqs. (4.15)-
(4.16) we have
U ¡
ij
N {2
‚
m 1
A
1
m
¡
ij
;
N {4
‚
m 1
K
2
m
¡
ij
N {4
‚
m 1
A
2
m
¡
ij
;
:::
K
L
1
¡
ij
r U
L
1
;U
L
2
s ¡
ij
; (4.17)
Thus,U has orthonormal rows.
4.2.4 Information Preservation Property
Besides its low computation complexity, linearity and orthonormality, TMPCA can pre-
serve the information of its input effectively so as to facilitate the following classifica-
tion process. To show this point, we investigate the mutual information [Lin88,BHS15]
between the input and the output of TMPCA.
Here, the input to the TMPCA system is modeled as
X G n; (4.18)
76
whereG andn are used to model the ground truth semantic signal and the noise com-
ponent in the input, respectively. In other words,G carries the essential information for
the text classification task whilen is irrelevant to (or weakly correlated) the task. We
are interested in finding the mutual information between outputY and ground truthG.
By following the framework in [Lin88], we make the following assumptions:
1. Y Np y;mV q;
2. n Npm0;mB q, wheremB
2
mI;
3. n is uncorrelated withG.
In above,N denotes the multivariant Gaussian density function. Then, the mutual infor-
mation betweenY andG can be computed as
I pY;Gq E
Y;G
ln
P pY |Gq
P pY q
;
E
Y;G
ln
NpUg;UmBU
T
q
Np y;mV q
;
1
2
ln
|V |
|UmBU
T
|
1
2
E
Y;G
py Ug q
T
pUBU
T
q
1
py Ug q
1
2
E
Y;G
py y q
T
V
1
py y q
; (4.19)
wherey P Y ,g P G, andP pq , | | andE
X
denote the probability density function, the
determinant and the expectation of random variableX, respectively. It is straightforward
to prove the following lemma,
Lemma 4.2.2. For any random vectorX PR
D
with covariance matrixK
x
, the follow-
ing equality holds
E
X
tpx xq
T
pK
x
q
1
px xqu D:
77
Then, based on this lemma, we can derive that
I pY;Gq 1
2
ln
|V |
2D
: (4.20)
The above equation can be interpreted below. Given input signal noise, the mutual
information can be maximized by maximizing the determinant of the output covariance
matrix. Since TMPCA maximizes the covariance of its output at each stage, TMPCA
will deliver an output with the largest mutual information at the corresponding stage.
We will show experimentally in Section 4.3 that the mutual information of TMPCA is
significantly larger than that of the mean operation and close to that of PCA.
4.3 Experimental Results
4.3.1 Datasets
We tested the performance of the TMPCA method on twelve datasets of various text
classification tasks as shown in Table 4.3. Four of them are smaller datasets with at
most 10,000 training samples. The other eight are large-scale datasets [ZJL15] with
training samples ranging from 120 thousands to 3.6 millions.
These datasets are briefly introduced below.
1. SMS Spam (spam) [AHY11]. It is a dataset collected for mobile Spam email
detection. It has two target classes: “Spam” and “Ham”.
2. Stanford Sentiment Treebank (sst) [SPW
13]. It is a dataset for sentiment anal-
ysis. The labels are generated using the Stanford CoreNLP toolkit [Staa]. The
sentences labeled as very negative or negative are grouped into one negative class.
Sentences labeled as very positive or positive are grouped into one positive class.
We keep only positive and negative sentences for training and testing.
78
Table 4.3: Selected text classification datasets.
# of Class Train Samples Test Samples # of Tokens
spam 2 5,574 558 14,657
sst 2 8409 1803 18,519
semeval 2 5098 2034 25,167
imdb 2 10162 500 20,892
agnews 4 120,000 7,600 188,111
sogou 5 450,000 60,000 800,057
dbpedia 14 560,000 70,000 1,215,996
yelpp 2 560,000 38,000 1,446,643
yelpf 5 650,000 50,000 1,622,077
yahoo 10 1,400,000 60,000 4,702,763
amzp 2 3,600,000 400,000 4,955,322
amzf 5 3,000,000 650,000 4,379,154
3. Semantic evaluation 2013 (semeval) [WKN
13]. It is a dataset for sentiment
analysis. We focus on Sentiment task-A with positive/negative two target classes.
Sentences labeled as “neutral” are removed.
4. Cornell Movie review (imdb) [BL05]. It is a dataset for sentiment analysis for
movie reviews. It contains a collection of movie review documents with their
sentiment polarity (i.e., positive or negative).
5. AG’s news (agnews) [ZZ05]. It is a dataset for news categorization. Each sample
contains the news title and description. We combine the title and description into
one sentence by inserting a colon in between.
6. Sougou news (sogou) [ZZ05]. It is a Chinese news categorization dataset. Its
corpus uses a phonetic romanization of Chinese.
7. DBPedia (dbpedia) [ZZ05]. It is an ontology categorization dataset with its sam-
ples extracted from the Wikipedia. Each training sample is a combination of its
title and abstract.
79
8. Yelp reviews (yelpp and yelpf) [ZZ05]. They are sentiment analysis datasets.
The Yelp review full (yelpf) has target classes ranging from one to five stars. The
one star is the worst while five stars the best. The Yelp review polarity (yelpp) has
positive/negative polarity labels by treating stars 1 and 2 as negative, stars 4 and 5
positive and omitting star 3 in the polarity dataset.
9. Yahoo! answers (yahoo) [ZZ05]. It is a topic classification dataset for Yahoo’s
question and answering corpus.
10. Amazon reviews (amzp and amzf) [ZZ05]. These two datasets are similar to
Yelp reviews but of much larger sizes. They are about Amazon product reviews.
4.3.2 Experimental Setup
We compare the performance of the following three methods on the four small datasets:
1. TMPCA-preprocessed data followed by the dense network (TMPCA+Dense);
2. fastText;
3. PCA-preprocessed data followed by the dense network (PCA+Dense).
For the eight larger datasets, we compare the performance of six methods. They are:
1. TMPCA-preprocessed data followed by the dense network (TMPCA+Dense);
2. fastText;
3. PCA-preprocessed data followed by the dense network (PCA+Dense);
4. char-CNN [ZJL15];
5. LSTM (an RNN based method) [ZJL15];
80
6. BoW [ZJL15].
Besides training time and classification accuracy, we compute the mutual informa-
tion between the input and the output of the TMPCA method, the mean operation (used
by fastText for hidden vector computation) and the PCA method, respectively. Note that
the mean operation can be expressed as a linear transform in form of
Y 1
N
rmI;:::;mI sX; (4.21)
where mI P R
D D
is the identity matrix and the mean transform matrix hasN mI’s.
The mutual information between the input and the output of the mean operation can be
calculated as
I pY;Gq 1
2
ln
|V |N
D
2D
: (4.22)
For fixed noise variance
2
, we can compare the mutual information of the input and the
output of different operations by comparing their associated |V |, |V |N
D
.
To illustrate the information preservation property of TMPCA across multiple
stages, we compute the output energy, which is the sum of squared elements in a vec-
tor/tensor, as a percentage of its input energy, and see how the energy values decrease
as the stage number becomes bigger. Such investigation is meaningful since the energy
indicates signal’s variance in a TMPCA system. The variance is a good indicator of
information richness. The energy percentage is an indicator of the amount of input
information that is preserved after one TMPCA stage. We compute the total energy of
multiple sentences by adding them together.
To numericalize the input data, we first remove the stop words from sentences
according to the stop-word list, tokenize sentences and, then, stem tokens using the
81
python natural language toolkit (NLTK). Afterwards, we use the fastText-trained embed-
ding layer to embed the tokens into vectors of size 10. The tokens are then concatenated
to form a single long vector.
In TMPCA, to ensure that the input sequence is of the same length and equal to a
power of 2, we assign a fixed input length, N 2
L
, to all sentences of lengthN
1
. If
N
1
N, we preprocess the input sequence by padding it to be of lengthN with a special
symbol. IfN
1
¡ N, we shorten the input sequence by dividing it intoN segments and
calculating the mean of numericalized elements in each segment. The new sequence is
then formed by the calculated means. The reason of dividing a sequence into segments
is to ensure consecutive elements as close as possible. Then, the segmentation of an
input sequence can be conducted as follows.
1. Calculate the least number of elements that each segment should have: d floorpN
1
{N q, where floor denotes flooring operation.
2. Then we allocate the remaining r N
1
dN elements by adding one more
element to every other floorpN {rq segments until there are no more elements left.
To give an example, to partition the sequence tw
1
; ;w
10
u into four segments, we have
3, 2, 3, 2 elements in these four segments, respectively. That is, they are: tw
1
;w
2
;w
3
u,
tw
4
;w
5
u, tw
6
;w
7
;w
8
u, tw
9
;w
10
u.
For large-scale datasets, we calculate the training data covariance matrix for TMPCA
incrementally by calculating the covariance matrix on each smaller non-overlapping
chunk of the data and, then, adding the calculated matrices together. The parameters
used in dense network training are shown in Table 4.4. For TMPCA and PCA, the
numericalized input data are first preprocessed to a fixed length and, then, have their
means removed. TMPCA, fastText and PCA were trained on Intel Core i7-5930K CPU.
The dense network was trained on the GeForce GTX TITAN X GPU. TMPCA and
82
PCA were not optimized for multi-threading whereas fastText was run on 12 threads in
parallel.
Table 4.4: Parameters in dense network training.
Input size 10
Output size # of target class
Training steps 5 epochs
Learning rate 0.5
Training optimizer Adam [KB15]
4.3.3 Results
Performance Benchmarking with State-of-the-Art Methods
We report the results of using the TMPCA method for feature extraction and the dense
network for decision making in terms of test accuracy, time complexity and model com-
plexity for text classification with respect to the eight large datasets. Furthermore, we
conduct performance benchmarking between the proposed TMPCA model against sev-
eral state-of-the-art models.
The bigram training data for the dense network are generated by concatenating the
bigram representation of the samples to their original. For example, for sample of
tw
1
;w
2
;w
3
u, after the bigram process, it becomes tw
1
;w
2
;w
3
;w
1
w
2
;w
2
w
3
u. The mod-
els other than TMPCA are from their original reports in [ZJL15] and [JGBM17]. There
are two char-CNN models. We report the test accuracy of the better model in Table 4.5
and the time and model complexity of the smaller model in Tables 4.6 and 4.7. The time
reported for char-CNN and fastText in Table 4.6 is for one epoch only.
It is obvious that the TMPCA+Dense method is much faster. Besides, it achieves
better or commensurate performance as compared with other state-of-the-art methods.
83
In addition, the number of parameters of TMPCA is also much less than other models
as shown in Table 4.7.
Table 4.5: Testing accuracy (%) of different TC models.
BoW LSTM char-CNN fastText
TMPCA+Dense
(bigram,N 8)
agnews 88.8 86.1 87.2 91.5 92.1
sogou 92.9 95.2 95.1 93.9 97.0
dbpedia 96.6 98.6 98.3 98.1 98.6
yelpp 92.2 94.7 94.7 93.8 95.1
yelpf 58.0 58.2 62.0 60.4 64.1
yahoo 68.9 70.8 71.2 72.0 72.0
amzp 90.4 93.9 94.5 91.2 94.2
amzf 54.6 59.4 59.5 55.8 59.0
Table 4.6: Comparison of training time for different models.
small char-CNN/epoch fastText/epoch
TMPCA+Dense
(bigram,N 8)
agnews 1h 1s 0.025s
sogou - 7s 0.081s
dbpedia 2h 2s 0.101s
yelpp - 3s 0.106s
yelpf - 4s 0.116s
yahoo 8h 5s 0.229s
amzp 2d 10s 0.633s
amzf 2d 9s 0.481s
Comparison between TMPCA and PCA
We compare the performance between TMPCA+Dense and PCA+Dense to shed light
on the property of TMPCA. Their input are unigram data in each original dataset. We
compare their training time in Table 4.8. It clearly shows the advantage of TMPCA in
terms of computational efficiency. TMPCA takes less than one second for training in
84
Table 4.7: Comparison of model parameter numbers in different models.
small char-CNN/epoch fastText/epoch
TMPCA+Dense
(bigram,N 8)
agnews
2.7e+06
1.9e+06
600
sogou 8e+06
dbpedia 1.2e+07
yelpp 1.4e+07
yelpf 1.6e+07
yahoo 4.7e+07
amzp 5e+07
amzf 4.4e+07
most datasets. As the length of the input sequence is longer, the training time of TMPCA
grows linearly. In contrast, it grows much faster in the PCA case.
To show the information preservation property of TMPCA, we include fastText in
the comparison. Since the difference between these three models is the way to compute
the hidden vector, we compare TMPCA, mean operation (used by fastText), and PCA.
We show the accuracy for input sequences of length 2, 4, 8, 16 ad 32 in Fig. 4.2. They
correspond to the 1-, 2-, 3-, 4- and 5-stage TMPCA, respectively. We show two relative
mutual information values in Table 4.9 and Table 4.10. Table 4.9 provides the mutual
information ratio between TMPCA and mean. Table 4.10 offers the mutual informa-
tion ratio between PCA and TMPCA. We see that TMPCA is much more capable than
mean and is comparable with PCA in preserving the mutual information. Although
higher mutual information does not always translate into better classification perfor-
mance, there is a strong correlation between them. This substantiates our mutual infor-
mation discussion. We should point out that the mutual information on different inputs
(in our case, different N values) is not directly comparable. Thus, a higher relative
mutual information value on longer inputs cannot be interpreted as containing richer
information and, consequently, higher accuracy. We observe that the dense network
achieves its best performance whenN 4 or 8.
85
To understand information loss at each TMPCA, we plot their energy percentages in
Fig. 4.3 where the input has a length ofN 32. For TMPCA, the energy drops as the
number of stage increases, and the sharp drop usually happens after 2 or 3 stages. This
observation is confirmed by the results in Fig. 4.2. For performance benchmarking, we
provide the energy percentage of PCA in the same figure. Since the PCA has only one
stage, we use a horizontal line to represent the percentage level. Its value is equal or
slightly higher than the energy percentage at the final stage of TMPCA. This is collabo-
rated by the closeness of their mutual information values in Table 4.10. The information
preserving and the low computational complexity properties make TMPCA an excellent
dimension reduction pre-processing tool for text classification.
Table 4.8: Comparison of training time in seconds (TMPCA/PCA).
N 4 N 8 N 16 N 32
spam 0.007/0.023 0.006/0.090 0.007/0.525 0.011/7.389
sst 0.007/0.023 0.006/0.090 0.008/0.900 0.009/5.751
semeval 0.005/0.017 0.007/0.111 0.021/2.564 0.009/5.751
imdb 0.006/0.019 0.008/0.114 0.009/0.781 0.009/6.562
agnews 0.014/0.053 0.017/0.325 0.033/4.100 0.061/47.538
sogou 0.029/0.111 0.053/1.093 0.134/17.028 0.214/173.687
dbpedia 0.039/0.145 0.092/1.886 0.125/15.505 0.348/279.405
yelpp 0.037/0.145 0.072/1.517 0.163/20.740 0.272/222.011
yelpf 0.035/0.137 0.072/1.517 0.157/19.849 0.328/268.698
yahoo 0.068/0.269 0.129/2.714 0.322/40.845 0.787/642.278
amzp 0.184/0.723 0.379/8.009 0.880/112.021 1.842/1504.912
amzf 0.167/0.665 0.351/7.469 0.778/99.337 1.513/1237.017
4.4 Conclusion
An efficient language data dimension reduction technique, called the TMPCA method,
was proposed for TC problems in this work. TMPCA is a multi-stage PCA in special
form, and it can be described by a transform matrix with orthonormal rows. It can retain
86
2 4 8 16 32
0.97
0.98
0.99
1
spam
(a)
2 4 8 16 32
0.75
0.8
0.85
0.9
sst
(b)
2 4 8 16 32
0.77
0.78
0.79
0.8
semeval
(c)
2 4 8 16 32
0.72
0.74
0.76
imdb
(d)
2 4 8 16 32
0.9
0.91
0.92
agnews
(e)
2 4 8 16 32
0.93
0.94
0.95
sogou
(f)
2 4 8 16 32
0.9
0.95
1
dbpedia
(g)
2 4 8 16 32
0.935
0.94
0.945
0.95
yelpp
(h)
2 4 8 16 32
0.58
0.59
0.6
yelpf
(i)
2 4 8 16 32
0.7
0.71
0.72
yahoo
(j)
2 4 8 16 32
0.91
0.915
0.92
amzp
(k)
2 4 8 16 32
0.52
0.54
0.56
0.58
amzf
(l)
Figure 4.2: Comparison of testing accuracy (%) of fastText (dotted blue),
TMPCA+Dense (red solid), and PCA+Dense (green head dotted), where the horizontal
axis is the input lengthN.
87
1 2 3 4 5
40
60
80
100
spam
(a)
1 2 3 4 5
0
50
100
sst
(b)
1 2 3 4 5
0
50
100
semeval
(c)
1 2 3 4 5
0
50
100
imdb
(d)
1 2 3 4 5
0
50
100
agnews
(e)
1 2 3 4 5
0
50
100
sogou
(f)
1 2 3 4 5
0
50
100
dbpedia
(g)
1 2 3 4 5
0
50
100
yelpp
(h)
1 2 3 4 5
0
50
100
yelpf
(i)
1 2 3 4 5
0
50
100
yahoo
(j)
1 2 3 4 5
40
60
80
100
amzp
(k)
1 2 3 4 5
0
50
100
amzf
(l)
Figure 4.3: The energy of TMPCA (red solid) and PCA (green head dotted) coefficients
is expressed as percentages of the energy of input sequences of lengthN 32, where
the horizontal axis indicates the TMPCA stage number while PCA has only one stage.
88
Table 4.9: The relative mutual information ratio (TMPCA versus Mean).
N 2 N 4 N 8 N 16 N 32
spam 1.32e+02 7.48e+05 2.60e+12 5.05e+14 9.93e+12
sst 8.48e+03 1.22e+10 1.28e+15 8.89e+15 9.17e+13
semeval 5.52e+03 1.13e+09 3.30e+14 4.78e+15 1.67e+13
imdb 1.34e+04 3.49e+09 1.89e+14 8.73e+14 1.05e+13
agnews 4.10e+05 5.30e+10 7.09e+11 3.56e+12 6.11e+12
sogou 5.53e+08 1.37e+13 6.74e+13 5.40e+13 4.21e+13
dbpedia 20.2 111 227 814 306
yelpp 8.42e+04 2.79e+11 3.85e+15 5.65e+16 1.46e+16
yelpf 2.29e+07 1.90e+11 5.92e+12 5.42e+12 1.58e+12
yahoo 6.7 9.1 9.9 5.8 1.5
amzp 7.34e+05 4.48e+11 1.24e+16 1.15e+18 2.75e+18
amzf 3.09e+06 1.47e+10 3.38e+11 1.48e+12 2.37e+12
Table 4.10: The relative mutual information ratio (PCA versus TMPCA).
N 4 N 8 N 16 N 32
spam 1.04 1.00 1.00 1.49
sst 1.00 1.00 1.00 1.36
semeval 0.99 1.00 1.00 1.09
imdb 1.02 1.00 1.00 1.29
agnews 1.00 1.01 1.40 2.92
sogou 1.00 1.20 1.66 5.17
dbpedia 1.16 1.63 1.65 1.75
yelpp 1.00 1.00 1.00 1.13
yelpf 1.00 1.01 1.01 1.10
yahoo 1.01 1.30 1.94 8.78
amzp 1.00 1.00 1.00 1.10
amzf 1.00 1.00 1.03 1.41
the input information by maximizing the mutual information between its input and out-
put, which is beneficial to TC problems. It was shown by experimental results that a
dense network trained on the TMPCA preprocessed data outperforms state-of-the-art
fastText, char-CNN and LSTM in quite a few TC datasets. Furthermore, the number of
parameters used by TMPCA is an order of magnitude smaller than other NN-based mod-
els. Typically, TMPCA takes less than one second training time on a large-scale dataset
89
that has millions of samples. To conclude, TMPCA is a powerful dimension reduc-
tion pre-processing tool for text classification for its low computational complexity, low
storage requirement for model parameters and high information preserving capability.
90
Chapter 5
Image-assisted Neural Machine
Translation
5.1 Motivation
Unsupervised neural machine translation (UNMT) has recently achieved remarkable
results [LOC
18] with only large monolingual corpora in each language. However,
the uncertainty of associating target with source sentences makes UNMT theoretically
an ill-posed problem. This work investigates the possibility of utilizing images for
disambiguation to improve the performance of UNMT. Our assumption is intuitively
based on the invariant property of image, i.e., the description of the same visual content
by different languages should be approximately similar. We propose an unsupervised
multi-modal machine translation (UMNMT) framework based on the language transla-
tion cycle consistency loss conditional on the image, targeting to learn the bidirectional
multi-modal translation simultaneously. Through an alternate training between multi-
modal and uni-modal, our inference model can translate with or without the image. On
the widely used Multi30K dataset, the experimental results of our approach are signifi-
cantly better than those of the text-only UNMT on the 2016 test dataset.
Our long-term goal is to build intelligent systems that can perceive their visual envi-
ronment and understand the linguistic information, and further make an accurate trans-
lation inference to another language. Since image has become an important source
for humans to learn and acquire knowledge (e.g. video lectures, [ABA
16, KAS14,
91
Supervised Learning
En: A man in an orange hat Fr: un homme avec un chapeau
starring at something orange regardant quelque chose
Parallel En-Fr Corpus
English
Encoder
French
Decoder
!
" #
$
%&'
(,* = ,-. / *,*
English
Encoder
English
Decoder
Image
Encoder
French
Encoder
French
Decoder
Training with auto-encoding loss
Training with cycle-consistency loss
For two training losses
0 #
" !
0 !
Copy
$
1&23
(,4 = ,-. / (,(
$
56578
(,4 = ,-.(: (,()
<
!
Inference
Training
0 #
Training
English
Encoder
French
Decoder
Emm, Looks
like
Unsupervised Multi-modal Learning
En: A man in an orange hat …
Fr: un homme avec un chapeau ...
Image
Encoder
Figure 5.1: Illustration of our proposed approach. We leverage the designed loss func-
tion to tackle a supervised task with the unsupervised dataset only. SCE means sequen-
tial cross-entropy.
ZEL
18]), the visual signal might be able to disambiguate certain semantics. One way
to make image content easier and faster to be understood by humans is to combine it
with narrative description that can be self-explainable. This is particularly important for
many natural language processing (NLP) tasks as well, such as image caption [VTBE15]
and some task-specific translation–sign language translation [CHK
18].
Our idea is originally inspired by the text-only unsupervised MT (UMT) [CLR
18,
LCDR18, LOC
18], investigating whether it is possible to train a general MT system
without any form of supervision. As [LOC
18] discussed, the text-only UMT is funda-
mentally an ill-posed problem, since there are potentially many ways to associate target
92
with source sentences. Intuitively, since the visual content and language are closely
related, the image can play the role of a pivot “language” to bridge the two languages
without paralleled corpus, making the problem “more well-defined” by reducing the
problem to supervised learning. However, unlike the text translation involving word
generation (usually a discrete distribution), the task to generate a dense image from a
sentence description itself is a challenging problem [MPBS16]. High quality image
generation usually depends on a complicated or large scale neural network architec-
ture [RAY
16, XZH
18]. Thus, it is not recommended to utilize the image dataset as a
pivot “language” [YLL18]. Motivated by the cycle-consistency [ZPIE17], we tackle the
unsupervised translation with a multi-modal framework which includes two sequence-
to-sequence encoder-decoder models and one shared image feature extractor. We don’t
introduce the adversarial learning via a discriminator because of the non-differentiable
arg max operation during word generation. With five modules in our framework, there
are multiple data streaming paths in the computation graph, inducing the auto-encoding
loss and cycle-consistency loss, in order to achieve the unsupervised translation.
Another challenge of unsupervised multi-modal translation, and more broadly for
general multi-modal translation tasks, is the need to develop a reasonable multi-source
encoder-decoder model that is capable of handling multi-modal documents. Moreover,
during training and inference stages, it is better to process the mixed data format includ-
ing both uni-modal and multi-modal corpora.
First, this challenge highly depends on the attention mechanism across different
domains. Recurrent Neural Networks (RNN) and Convolutional Neural Networks
(CNN) are naturally suitable to encode the language text and visual image respectively;
however, encoded features of RNN has autoregressive property which is different from
the local dependency of CNN. The multi-head self-attention transformer [VSP
17] can
93
mimic the convolution operation, and allow each head to use different linear transfor-
mations, where in turn different heads can learn different relationships. Unlike RNN,
it reduces the length of the paths of states from the higher layer to all states in the
lower layer to one, and thus facilitates more effective learning. For example, the BERT
model [DCLT18], that is completely built upon self-attention, has achieved remarkable
performance in 11 natural language tasks. Therefore, we employ transformer in both the
text encoder and decoder of our model, and design a novel joint attention mechanism
to simulate the relationships among the three domains. Besides, the mixed data format
requires the desired attention to support the flexible data stream. In other words, the
batch fetched at each iteration can be either uni-modal text data or multi-modal text-
image paired data, allowing the model to be adaptive to various data during inference as
well.
5.2 Unsupervised Multi-modal Neural Machine Trans-
lation
5.2.1 Methodology
In this section we first briefly describe the main MT systems that our method is built
upon and then elaborate on our approach.
94
Neural Machine Translation
If a bilingual corpus is available, given a source sentence x p x
1
;:::;x
n
q ofn tokens,
and a translated target sentence y p y
1
;:::;y
m
q ofm tokens, where px;yq PX Y,
the NMT model aims at maximizing the likelihood,
ppy|xq m
‚
t 1
ppy
t
|y
t
;xq: (5.1)
The attention based sequence-to-sequence encoder-decoder architecture [BCB14,
WSC
16, GAMY
17, VSP
17] is usually employed to parameterize the above con-
ditional probability.
The encoder reads the source sentence and outputs the hidden representation vec-
tors for each token, th
e
1
;:::;h
e
n
u Enc
x
pxq. The attention based decoder is defined
in a recurrent way. Given the decoder has the summarized representation vector h
d
t
Dec
y
py
t
;xq at time stampt, the model produces a context vectorc
t
n
j 1
i
h
e
j
based
on an alignment model, t
1
;:::;
n
u Alignph
d
t
; th
e
1
;:::;h
e
n
uq, such that
n
j 1
j
1.
Therefore, the conditional probability to predict the next token can be written as,
ppy
t
|y
t
;xq softmaxpg pc
t
;y
t 1
;h
d
t 1
qq: (5.2)
in whichg pq denotes a non-linear function extracting features to predict the target. The
encoder and decoder model described here is in a general formulation, not constrained
to be RNN [BCB14] or transformer architecture [VSP
17].
Multi-modal Neural Machine Translation
In this task, an imagez and the description of the image in two different languages form
a triplet px;y;zq PX Y I. Thus, the problem naturally becomes maximizing the
95
ResNet-152
(to 4b35_ReLU)
Random deletion and permutation
a man in an orange hat starring at something .
Position encoding
Self-Attention
K
e
V
e
Q
e
Feed Forward
4 x
A man in an orange hat ...
Position encoding
Masked Self-Attention
Feed Forward
4 x
K
d
V
d
Q
d
Controllable
Multi-modal Attention
Reshape to 196x1024
Feed Forward
Transformer Encoder
Transformer Encoder
Transformer Decoder
Softmax
Transformer Decoder
un bébé potelé avec un chapeau …
Random deletion and permutation
un bébé potelé avec un chapeau …
Transformer
Encoder
Transformer
Decoder
ResNet ResNet
Transformer
Encoder
Auto-encoding Loss
English Corpus French Corpus
En: A man in an orange hat Fr: un bébé potelé avec un chapeau ...
noise
hat
noise
Transformer
Decoder
chapeau
Transformer
Encoder
Transformer
Decoder
ResNet
Transformer
Encoder
ResNet
Transformer
Decoder
Transformer
Encoder
Transformer
Decoder
Transformer
Encoder
Transformer
Decoder
Cycle-consistency Loss
English Corpus French Corpus
En: A man in an orange hat Fr: un bébé potelé avec un chapeau ...
hat chapeau
Figure 5.2: Model overview. Left Panel: The detailed unsupervised multi-modal neural
machine translation model includes five modules, two transformer encoder, two trans-
former decoder and one ResNet encoder. Some detailed network structures within the
transformer, like skip-connection and layer normalization, are omitted for clarity. Right
Panel: The entire framework consists of four training paths: the gray arrows in the paths
for cycle-consistency loss indicate the model is under inference mode. E.g., the time
step decoding for token “hat” is illustrated.
new likelihoodppy|x;zq. Though the overall framework of such a translation task is still
the encoder-decoder architecture, the detailed feature extractor and attention module can
vary greatly, due to the extra source image.
The traditional approach [SFSE16, EFB
17a] is to encode the source text and the
image separately and combine them at the high level features, where the image feature
map can be represented as th
i
1
;:::;h
i
k
u Enc
z
pzq and Enc
z
is usually a truncated image
classification model, such as Resnet [HZRS16b]. Notice that unlike the number of the
text features is exactly the number of tokens in the source, the number of the image
featuresk depends on the last layer in the truncated network. Then, the context vector
is computed via an attention model,
c
t
Attentionph
d
t
; th
e
1
;:::;h
e
n
u; th
i
1
;:::;h
i
k
uq (5.3)
96
Since three sets of features appear in Eq (5.3), there are more options of the attention
mechanism than text-only NMT. The decoder can remain the same in the recurrent fash-
ion.
Unsupervised Learning
The unsupervised problem requires a new problem definition. On both the source and
the target sides, only monolingual documents are presented in the training data, i.e., the
data comes in the paired form of px;zq PX I and py;zq PY I. The triplet data
format is no longer available. The purpose is to learn a multi-modal translation model
X I Y or a text-only oneX Y. Note there is no explicit paired information
cross two languages, making it impossible to straightforwardly optimize the supervised
likelihood. Fortunately, motivated by the CycleGAN [ZPIE17] and the dual learning
in [HXQ
16], we can actually learn the translation model for both directions between
the source and the target in an unsupervised way. Additionally, we can even make the
multi-modal and uni-modal inference compatible with deliberate fine-tuning strategy.
Auto-Encoding Loss
As Figure 5.2 illustrates, there are five main modules in the overall architecture, two
encoders and two decoders for the source and target languages, and one extra image
encoder. Since the lack of triplet data, we can only build the first two following denoised
auto-encoding losses without involving the pairedx andy,
L
auto
px;zq SCE pDec
x
pEnc
x
pxq; Enc
z
pzqq;xq (5.4)
L
auto
py;zq SCE pDec
y
pEnc
y
pyq; Enc
z
pzqq;xq (5.5)
97
whereSCE p; q represents sequential cross-entropy loss. We use “denoised” loss here,
because the exact auto-encoding structure will likely force the language model learning
a word-to-word copy network. The image is seemingly redundant since the text input
contains the entire information for recovery. However, it is not guaranteed that our
encoder is lossless, so the image is provided as an additional supplement to reduce the
information loss.
Cycle-Consistency Loss
The auto-encoding loss can, in theory, learn two functional mappingsX I X
andY I Y via the supplied training dataset. However, the two mappings are
essentially not our desiderata, even though we can switch the two decoders to build our
expected mappings, e.g.,X I Y. The crucial problem is that the transferred map-
pings achieved after switching decoders lack supervised training, since no regularization
pushes the latent encoding spaces aligned between the source and target.
We argue that this issue can be tackled by another two cycle-consistency properties
(note that we use the square brackets rs below to denote the inference mode, meaning
no gradient back-propagation through such operations),
Dec
x
pEnc
y
pDec
y
rEnc
x
pxq; Enc
z
pzqsq; Enc
z
pzqq x (5.6)
Dec
y
pEnc
x
pDec
x
rEnc
y
pyq; Enc
z
pzqsq; Enc
z
pzqq y (5.7)
The above two properties seem complicated, but we will decompose them step-by-step
to see its intuition, which are also the key to make the auto-encoders translation models
across different languages. Without loss of generality, we use Property (5.6) as our
illustration, where the same idea is applied to (5.7). After encoding the information
from source and image as the high level features, the encoded features are fed into the
98
decoder of another language (i.e. target language), thus obtaining an inferred target
sentence,
~ y F
xz y
px;zq Dec
y
rEnc
x
pxq; Enc
z
pzqs: (5.8)
Unfortunately, the ground truth y corresponding to the input x or z is unknown, so we
cannot trainF
xz y
at this time. However, since x is the golden reference, we can con-
struct the pseudo supervised triplet px; ~ y;zq as the augmented data to train the following
model,
F
yz x
p~ y;zq Dec
x
pEnc
y
p~ yq; Enc
z
pzqq: (5.9)
Note that the pseudo input ~ y can be considered as the corrupted version of the unknown
y. The noisy training step makes sense because injecting noise to the input data is
a common trick to improve the robustness of model even for traditional supervised
learning [SHK
14, WPDN18]. Therefore, we incentivize this behavior using the cycle-
consistency loss,
L
cyc
px;zq SCE pF
yz x
pF
xz y
px;zq;zq;xq: (5.10)
This loss indicates the cycle-consistency p5:6q, and the mappingY I X can be
successfully refined.
Controllable Attention
In additional to the loss function, another important interaction between the text and
image domain should focus on the decoder attention module. In general, we propose to
extend the traditional encoder-decoder attention to a multi-domain attention.
c
t
Attph
d
t
;h
e
q
1
Attph
d
t
;h
i
q
2
Attph
d
t
;h
e
;h
i
q (5.11)
99
where
1
and
2
can be either 1 or 0 during training, depending on whether the fetched
batch includes image data or not. For example, we can easily set up a flexible training
scheme by alternatively feeding the monolingual language data and text-image multi-
modal data to the model. A nice byproduct of this setup allows us to successfully make
a versatile inference with or without image, being more applicable to real scenarios.
In practice, we utilize the recent developed self-attention mechanism [VSP
17] as
our basic block, the hidden states contain three sets of vectors Q;K;V , representing
queries, keys and values. Therefore, our proposed context vector can be rewritten as,
c
t
softmax
Q
d
t
pK
e
q
J
?
d
V
e
1
softmax
Q
d
t
pK
i
q
J
?
d
V
i
2
softmax
Q
d
t
pK
ei
q
J
?
d
V
ei
2
softmax
Q
d
t
pK
ie
q
J
?
d
V
ie
(5.12)
whered is the dimensionality of keys, and rK
ei
;V
ei
s FFN
softmax
Q
e
pK
i
q
J
?
d
V
i
means the attention from text input to image input, and rK
ie
;V
ie
s represents the sym-
metric attention in the reverse direction. Note the notation Q
e
has no subscript and
denotes as a matrix, indicating the softmax is row-wise operation. In practice, espe-
cially for Multi30K dataset, we found
2
is less important and
2
0 brings no harm to
the performance. Thus, we always set it as 0 in our experiments, but non-zero
2
may
be helpful in other cases.
100
5.2.2 Experiments
Training and Testing on Multi30K
We evaluate our model on Multi30K [EFSS16b] 2016 test set of English French
(En Fr) and English German (En De) language pairs. This dataset is a multilin-
gual image caption dataset with 29000 training samples of images and their annotations
in English, German, French [EFB
17b] and Czech [BBS
18]. The validation set and
test set have 1014 and 1000 samples respectively. To ensure the model never sees any
paired sentences information (which is an unlikely scenario in practice), we randomly
split half of the training and validation sets for one language and use the complementary
half for the other. The resulting corpora is denoted as M30k-half with 14500 and 507
training and validation samples respectively.
To find whether the image as additional information used in the training and/or test-
ing stage can bring consistent performance improvement, we train our model in two dif-
ferent ways, each one has train with text only (-txt) and train with text+image (-txt-img)
modes. We would compare the best performing training method to the state-of-the-art,
and then do side-by-side comparison between them:
Pre-large (P): To leverage the controllable attention mechanism for exploring the
linguistic information in the large monolingual corpora, we create text only pre-training
set by combining the first 10 million sentences of the WMT News Crawl datasets from
2007 to 2017 with 10 times M30k-half. This ends up in a large text only dataset of
10145000 unparalleled sentences in each language. P-txt: We would then pre-train our
model without the image encoder on this dataset and use the M30k-half validation set
for validation. P-txt-img: Once the text-only model is pre-trained, we then use it for the
following fine-tuning stage on M30k-half. Except for the image encoder, we initialize
our model with the pre-trained model parameters. The image encoder uses pre-trained
101
ResNet-152 [HZRS16b]. The error gradient does not back-propagate to the original
ResNet network.
Scratch (S): We are also curious about the role of image can play when no pre-
training is involved. We train from scratch using text only (S-txt) and text with corre-
sponding image (S-txt-img) on M30k-half.
Implementation Details and Baseline Models
The text encoder and decoder are both 4 layers transformers with dimensionality 512,
and for the related language pair, we share the first 3 layers of transformer for both
encoder and decoder. The image encoder is the truncated ResNet-152 with output layer
res4b35 relu, and the parameters of ResNet are freezing during model optimization. Par-
ticularly, the feature map 14 14 1024 of layer res4b35 relu is flattened to 196 1024 so
that its dimension is consistent with the sequential text encoder output. The actual losses
(5.4) and (5.5) favor a standard denoising auto-encoders: the text input is perturbed with
deletion and local permutation; the image input is corrupted via dropout. We use the
same word preprocessing techniques (Moses tokenization, BPE, binarization, fasttext
word embedding on training corpora, etc.) as reported in [LOC
18], please refer to the
relevant readings for further details.
We would like to compare the proposed UMNMT model to the following UMT
models.
MUSE [CLR
18]: It is an unsupervised word-to-word translation model. The
embedding matrix is trained on large scale wiki corpora.
Game-NMT [YLL18]: It is a multi-modal zero-source UMT method trained using
reinforcement learning.
UNMT-text [LCDR18]: It is a mono-modal UMT model which only utilize text
data and it is pretrained on synthetic paired data generated by MUSE.
102
Models En Fr Fr En En De De En
MUSE 8.54 16.77 15.72 5.39
Game-NMT - - 16.6 19.6
UNMT-text 32.76 32.07 22.74 26.26
S-txt 6.01 6.75 6.27 6.81
S-txt-img 9.40 10.04 8.85 9.97
P-txt 37.20 38.51 20.97 25.00
P-txt-img 39.79 40.53 23.52 26.39
Table 5.1: BLEU benchmarking. The numbers of baseline models are extracted from
the corresponding references.
En Fr Fr En En De De En
Models Meteor Rouge CIDEr Meteor Rouge CIDEr Meteor Rouge CIDEr Meteor Rouge CIDEr
S-txt 0.137 0.325 0.46 0.131 0.358 0.48 0.116 0.306 0.35 0.128 0.347 0.47
S-txt-img 0.149 0.351 0.65 0.155 0.401 0.75 0.138 0.342 0.59 0.156 0.391 0.70
P-txt 0.337 0.652 3.36 0.364 0.689 3.41 0.254 0.539 1.99 0.284 0.585 2.20
P-txt-img 0.355 0.673 3.65 0.372 0.699 3.61 0.261 0.551 2.13 0.297 0.597 2.36
Table 5.2: UMNMT shows consistent improvement over text-only model across nor-
malized Meteor, Rouge and CIDEr metrics.
Benchmarking with state-of-the-art
In this section, we report the widely used BLEU score of test dataset in Table 5.1 for
different MT models. Our best model has achieved the state-of-the-art performance by
leading more than 6 points in En Fr task to the second best. Some translation examples
are shown in Figure 5.4. There is also close to 1 point improvement in the En De task.
Although pre-training plays a significant role to the final performance, the image also
contributes more than 3 points in case of training from scratch (S-txt vs. S-txt-img), and
around 2 points in case of fine tuning (P-txt vs. P-txt-img). Interestingly, it is observed
that the image contributes less performance improvement for pre-training than training
from scratch. This suggests that there is certain information overlap between the large
monolingual corpus and the M30k-half images. We also compare the Meteor, Rouge,
103
Figure 5.3: Validation BLEU comparison between text-only and text+image.
CIDEr score in Table 5.2 and validation BLEU in Figure 5.3 to show the consistent
improvement brought by using images.
Analysis
In this section, we would shed more light on how and why images can help for unsuper-
vised MT. We would first visualize which part of the input image helps the translation
by showing the heat map of the transformer attention. We then show that image not only
helps the translation by providing more information in the testing stage, it can also act
as a training regularizer by guiding the model to converge to a better local optimal point
in the training stage.
104
GT
un homme avec un chapeau orange regardant quelque chose
(a man in an orange hat starring at something)
P-txt
un homme en orange maettant quelque chose au loin
(a man in orange putting something o↵)
P-txt-img
un homme en chapeau orange en train de filmer quelque chose
(a man in an orange hat filming something)
GT
une femme en t-shirt bleu et short blanc jouant au tennis
(a woman in a blue shirt and white shorts playing tennis)
P-txt
une femme en t-shirt bleu et short blanc jouant au tennis
(a woman in blue t-shirt and white shorts playing tennis)
P-txt-img
une femme en t-shirt bleu et short blanc jouant au tennis
(a woman in blue t-shirt and white shorts playing tennis)
GT
un chien brun ramasse une brindille sur un revˆ etement en pierre
(a brown dog picks up a twig from stone surface)
P-txt
un chien marron retrouve un twig de pierre de la surface
(a brown dog finds a twig of stone from the surface)
P-txt-img
un chien brun acc` ede ` a la surface d’ un ´ etang
(a brown dog reaches the surface of a pond)
GT
un gar¸ con saisit sa jambe tandis il saute en air
(a boy grabs his leg as he jumps in the air)
P-txt
un gar¸ con se met ` a sa jambe devant lui
(a boy puts his leg in front of him)
P-txt-img
un gar¸ con installe sa jambe tandis il saute en air
(a boy installs his leg while he jumps in the air)
Table 1: Translation results from di↵erent models
(GT: ground truth)
1
Figure 5.4: Translation results from different models (GT: ground truth)
Attention
To visualize the transformer’s attention from regions in the input image to each
word in the translated sentences, we use the scaled dot-production attention of the
transformer decoder’s multi-head attention block as shown in Figure 5.2, more
specifically, it is the softmax
Q
d
t
pK
i
q
T
?
d
. This is a matrix of shapel
T
l
S
, where
l
T
is the translated sentence length andl
S
is the source length. Since we flatten
the 14 14 matrix from the ResNet152, the l
S
196. A heat map for the jth
word in the translation is then generated by mapping the value of kth entry in
tc
i
rj;k su
196
k 1
to their receptive field in the original image, averaging the value in
the overlapping area and then low pass filtering. Given this heat map, we would
visualize it in two ways: (1) We overlay the contour of the heat-map with the
original image as shown in the second, and fifth rows of Figure 5.5 and the second
105
a man in an orange hat starring at something
a woman in a blue shirt and white shorts playing tennis
Figure 5.5: Correct attention for f“humme”, “chapeau”, “orange”, “chose”g and
f“bleu”, “t-shirt”, “blanc”, “short”g.
row of Figure 5.6; (2) We normalize the heat map between 0 and 1, and then
multiply it with each color channel of the input image pixel-wise as shown in the
third and sixth rows of Figure 5.5 and in the third row of Figure 5.6.
We visualize the text attention by simply plotting the text attention matrix
softmax
Q
d
t
pK
e
q
T
?
d
in each transformer decoder layer as shown in “Text decoder
attention by layers” in these two figures.
Figure 5.5 shows two positive examples that when transformer attends to the right
regions of the image like “orange”, “chapeau”, or “humme” (interestingly, the
nose) in the upper image or “bleu”, “t-shirt”, “blanc” or “short” in the lower
image. Whereas in Figure 5.6, transformer attends to the whole image and treat it
106
a brown dog picks up a twig from stone surface .
Figure 5.6: Correct attention forf“chien”, “brun”, “acc` ede” and “surface”g, but missed
“twig” for “´ etang”.
Models En Fr Fr En En De De En
S-txt 6.01 6.75 6.27 6.81
S-txt-img 7.55 7.66 7.70 7.53
P-txt 37.20 38.51 20.97 25.00
P-txt-img 39.44 40.30 23.18 25.47
Table 5.3: BLEU for testing with TEXT ONLY input
as a pond instead of focusing on the region where a twig exists. As a result, the
twig was mistook as pond. For the text attention, we can see the text heat map
becomes more and more diagonal as the decoder layer goes deeper in both figures.
This indicates the text attention gets more and more focused since the English and
French have similar grammatical rules.
Generalizability
As shown in Equation 1.1, the model would certainly get more information when
image is present in the inferencing stage, but can images be helpful if they are
used in the training stage but not readily available during inferencing (which is
a very likely scenario in practice)? Table 5.3 shows that even when images are
not used, the performance degradation are not that significant (refer to Row 6-8
107
Models En Fr Fr En En De De En
S-txt 13.26 11.37 4.15 6.14 S-txt-img 16.10 13.30 6.40 7.91 P-txt 1.19 1.70 1.39 2.00 P-txt-img 5.52 2.46 1.72 3.12 Table 5.4: BLEU INCREASE ( ) UMNMT model trained on full Multi30k over
UMNMT model trained on M30k-half (Table 5.1 Row 5-8).
in Table 5.1 for comparison) and the trained with image model still outperforms
the trained with text only model by quite a margin. This suggests that images can
serve as additional information in the training process, thus guiding the model to
converge to a better local optimal point. Such findings also verify the proposed
controllable attention mechanism. This indicates the requirement of paired image
and monolingual text in the testing stage can be relaxed to feeding the text-only
data if paired image or images are not available.
Uncertainty Reduction
To show that images help MT by aligning different languages with similar mean-
ings, we also train the UMNMT model on the whole Multi30K dataset where the
source and target sentences are pretended unparalleled (i.e., still feed the image
text pairs to model). By doing this, we greatly increase the sentences in different
languages of similar meanings, if images can help align those sentences, then the
model should be able to learn better than the model trained with text only. We can
see from Table 5.4 that the performance increase by using images far outstrip the
model trained on text only data, in the case of En Fr, the P-txt-img has more
than 4 points gain than the P-txt.
108
5.3 Conclusion
In this work, we proposed a new unsupervised NMT model with multi-modal attention
(one for text and one for image) which is trained under an auto-encoding and cycle-
consistency paradigm. Our experiments showed that images as additional information
can significantly and consistently improve the UMT performance. This justifies our
hypothesis that the utilization of the multi-modal data can increase the mutual infor-
mation between the source sentences and the translated target sentences. We have also
showed that UMNMT model trained with images can achieve a better local optimal
point and can still achieve better performance than trained with text-only model even
if images are not available in the testing stage. Overall, our work pushes unsupervised
machine translation more applicable to the real scenario.
109
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
“A search space Odyssey”, German computer scientist J¨ urgen Schmidhuber, one of the
finding fathers of LSTM, once used this term to describe the world of RNN and used
it in the title of his paper about LSTM’s learning behavior. The same term can ele-
gantly summarize our struggle in “finding memory in time”: the unconquered land of
understanding the sequence learning system we’ve already created is full of surprises
and serendipities, there are many directionals we can go, yet each direction is awaiting
exploration, thus each direction is an Odyssey itself.
In this Odyssey, we try to offer perspectives and lay the foundation for addressing
the following questions: what is memory and how we can build a system that can learn
efficiently by remembering? Does visual imagination can help and if yes, how we can
build a system that handles both language and vision? The foundation we built for the
former question are two computing architectures: one is ELSTM for RNN modeling
and the other is TMPCA for DR, which are derived from the perspective of memory
as a system function and memory as DR respectively. From the first perspective, we
did detailed analysis of RNN cells model, demystify their model properties and found
the downside of their model designs. Our new proposal of ELSTM stems from the
analysis demonstrate outstanding performance for complex language tasks. From the
second perspective, we “sailed” through the world of DR and found the merit of PCA
110
for sequence learning systems by extending it to a multi-stage architecture which com-
putes much faster than ordinary PCA while retaining much of the other merits. To
answer the latter question, we argued that visual information can benefit the language
learning task by increasing the system’s mutual information and successfully deployed a
Transformer-based multi-modal NMT system that is trained/fine-tuned unsupervisedly
on image captioning dataset. It is one of the first such systems ever developed for unsu-
pervised MT and the new UMNMT system for the first time shows that a multi-modal
solution can outperform the text-only ones significantly.
6.2 Future Work
To continue the investigation of neural multi-modal systems, we would like first isolate
the error contributions from CNN and RNN of the hybrid network, understanding the
mechanism behind the phenomenon and then find possible improvements. We carry out
such investigation on image captioning problem as shown in Fig. 6.1.
The conjecture of the experiment is: if the output of CNN or RNN is not uniquely
representative of the ground truth class say A, and it is confused with another class say
class B, then the distance between the point of output of CNN or RNN in the classifi-
cation space should be closer to the semantic point of class B than A. Here the seman-
tic point is the point in space that occupies the centroid of the representative area of
that class. If To find out such point for each class in CNN classification space and in
RNN classification space, we calculate the mean of the CNN/RNN output feature vec-
tors given input images of the same class in the training dataset. Here, the classes are
the nouns, verbs, adverbs, and adjectives in the captions. Once the semantic points of
each classes are prepared, we then do inference on the development dataset with known
ground truth captions. For each input image, we will have a network generated caption
111
(a) Output: a bathroom with
a sink , toilet , and mirror
Ground Truth: The bathroom
has double sinks and there is a
television mounted
(b) Output: a group of peo-
ple sitting on a bench Ground
Truth: A man on a bench is
covered with birds
(c) Output: a bunch of umbrel-
las that are in the grass Ground
Truth: Parasols of different colors
hanging from tall trees
(d) Output: a hot dog and
french fries are on a plate
Ground Truth: A sandwich
with a large pickle sitting on top
(e) Output: a table topped with
lots of plates of food Ground
Truth: A table full of food such
as peas and carrots, bread, salad
and gravy
(f) Output: a close up of a
pizza on a plate Ground Truth:
A pizza is center loaded with
toppings and cheese
Figure 6.1: Image captioning errors
and the group truth caption, we then choose the nouns, verbs, adverbs, and adjectives
from those two captions, the ones exist only in the generated caption are treated as wrong
class B, the ones that only appear in the ground truth captions are treated as class A. For
example, suppose the generated caption has nouns “bird”, “tree”, and “flower” and the
ground truth caption has nouns “Swallow”, “grass”, “flower” and “sky”, then the nouns
used for comparison are:
Classes A (ground truth nouns): Swallow, grass, sky
Classes B (generated nouns): bird, tree
112
The CNN output point is chosen as the feature vector from its last fully connected
layer, the RNN output point is the hidden state generated at the end of caption. Then, for
CNN, we calculate the L2 norm distance between its output point to the semantic point
of each class in A and in B. We would then calculate an averaged distance for category
A and category B respectively given current input image, we would say that CNN is
“correct” for current input if the averaged distance of A is shorter than the distance of
B, we would say the CNN is “wrong” otherwise. We do the same on RNN.
The hybrid network we used for investigation is the one in [VTBE15], we used its
default settings and run 1 million training steps. The result is shown in Table 6.1, the
conclusion is that CNN contributes the most to the error.
Table 6.1: Error Sources
CNN Results RNN Results Case Frequency
Case 1 Wrong Wrong 0.62
Case 2 Correct Wrong 0.09
Case 3 Wrong Correct 0.09
Although it is widely believed that the CNN is responsible for the image processing
and the RNN part is responsible for language processing, there is no evidence substan-
tiates this claim. On the contrary, the very fact that in practice the CNN used for a
hybrid network is mostly pre-trained suggests otherwise. In practice, the CNN module
is mostly pre-trained on other tasks (e.g. trained on ImageNet [DDS
09] for image
classification), it is then integrated with a random initialized RNN. During the training
process, some hybrid networks does not back-propagate the error to the CNN. By doing
this, the functionality of the CNN module is ensured to be image feature extraction.
For other network designs, the CNN is fine-tuned with the error back-propagates to it.
Should the CNN be responsible solely for the image part, training a CNN in a hybrid
network should be the same as training it separately, where a pre-trained CNN is not
113
necessary. Using a pre-trained CNN is rather a compromise since a hybrid is very diffi-
cult to train from scratch. The problem of using pre-trained CNN model is that the CNN
model trained on other tasks usually over-fits to the training dataset for those tasks. For
example, a CNN trained on ImageNet that is tasked to assign an image to one of the
1000 classes would find difficulty to recognize the features of the objects outside those
classes. Such problem is well known in transfer learning which is using a model trained
on other tasks. Fig. 6.2 shows an example that the error of mis-detecting “trees” as
“grass” comes from CNN.
Figure 6.2: Generated: a bunch of umbrellas that are in the grass Ground Truth:
Parasols of different colors hanging from tall trees
The ranking of distance shown in the above example is:
Table 6.2: Ranking of Distance: Ascending Order
CNN(class) RNN(class)
umbrellas(B) umbrellas(B)
grass(B) grass(B)
bunch(B) bunch(B)
trees(A) trees(A)
In some other cases, the RNN and beam-search introduces more errors
114
Figure 6.3: Generated: a man and a woman sitting on a bench Ground Truth: Man
riding double bicycle stopped on road reading cell phone
Table 6.3: Ranking of Distance: Ascending Order
CNN(class) RNN(class)
bicycle(A) bench(B)
road(A) bicycle(A)
man(B) road(A)
woman(B) man(B)
bench(B) woman(B)
phone(A) cell(A)
cell(A) phone(A)
Beam-search is a searching technique which gives the sequence with the greatest
joint probability. In the above case, “bench” is somehow mis-detected by RNN, and
consequently, the jointly probability of “man”, “woman” and “bench” is increased.
How to improve the hybrid network based on this understanding requires further
investigation.
115
Appendices
A Chapter 3 : More Results
we compare the training perplexity betweenI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s for various
models with the LSTM, and the ELSTM cells in Fig. A.1.
Figure A.1: The training perplexity vs. training steps of different models withI
t
X
t
andI
T
t
r X
T
t
;h
T
t 1
s for the POS tagging task.
116
B Chapter 4 : Derivation of System Function
We use the same notations in Sec. 4.2. For stages ¡ 1, we have:
z
s
j
U
s
z
s 1
2j 1
z
s 1
2j
U
s
1
z
s 1
2j 1
U
s
2
z
s 1
2j
; (1)
wherej 1; ;
N
2
s
. Whens 1, we have
z
1
j
U
1
1
w
2j 1
U
1
2
w
2j
(2)
From Eqs. (1) and (2), we get
Y z
L
1
N
‚
j 1
L
„
s 1
U
s
f
j;s
w
j
; (3)
f
j;s
b
L
pj 1q
s
1; (4)
whereb
L
pxq
s
is thesth digit of the binarization ofx of lengthL. Eq. (3) can be further
simplified to Eq. (4.6). For example, ifN 8, we obtain
Y U
3
1
U
2
1
U
1
1
w
1
U
3
1
U
2
1
U
1
2
w
2
U
3
1
U
2
2
U
1
1
w
3
U
3
1
U
2
2
U
1
2
w
4
U
3
2
U
2
1
U
1
1
w
5
U
3
2
U
2
1
U
1
2
w
6
U
3
2
U
2
2
U
1
1
w
7
U
3
2
U
2
2
U
1
2
w
8
: (5)
The superscripts ofU
s
j
are arranged in the stage order ofL;L 1;:::; 1. The subscripts
are shown in Table B.1. This is the reason that binarization is required to express the
subscripts in Eqs. (4.6) and (3).
Table B.1: Subscripts ofU
s
j
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
1,1,1 1,1,2 1,2,1 1,2,2 2,1,1 2,1,2 2,2,1 2,2,2
117
C Chapter 5 : More Visualization Results
118
(a)
(b)
(c)
(d)
119
(a)
(b)
(c)
(d)
120
(a)
(b)
(c)
(d)
121
(a)
(b)
(c)
(d)
122
(a)
(b)
(c)
(d)
123
(a)
(b)
(c)
(d)
124
(a)
(b)
(c)
(d)
125
Bibliography
[ABA
16] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic,
Ivan Laptev, and Simon Lacoste-Julien. Unsupervised learning from nar-
rated instruction videos. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 4575–4583, 2016.
[ACPSRI17] Oscar Araque, Ignacio Corcuera-Platas, J. Fernando S´ anchez-Rada, and
Carlos A. Iglesias. Enhancing deep learning sentiment analysis with
ensemble techniques in social applications. Expert Systems with Applica-
tions, 77:236–246, Jul 2017.
[AHY11] T. A. Almeida, Jos´ e Mar´ ıa G´ omez Hidalgo, and J. M. Yamakami.
Contributions to the study of sms spam filtering: New collection
and results. https://archive.ics.uci.edu/ml/datasets/
sms+spam+collection, 2011.
[ALAC18] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsu-
pervised neural machine translation. In International Conference on
Learning Representations (ICLR), 2018.
[AWT
18] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson,
Niko S¨ underhauf, Ian Reid, Stephen Gould, and Anton van den Hengel.
Vision-and-language navigation: Interpreting visually-grounded naviga-
tion instructions in real environments. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR) 2018, pages
3674–3683, 2018.
[BBS
18] Lo¨ ıc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond
Elliott, and Stella Frank. Findings of the third shared task on multimodal
machine translation. In Proceedings of the Third Conference on Machine
Translation: Shared Task Papers, pages 304–323, 2018.
[BCB14] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
machine translation by jointly learning to align and translate. arXiv
preprint arXiv:1409.0473, 2014.
126
[BCB15] Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural
machine translation by jointly learning to align and translate. In Pro-
ceedings of The International Conference on Learning Representations,
2015.
[BDVJ03] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin.
A neural probabilistic language model. Journal of Machine Learning
Research, pages 1137–1155, 2003.
[BGC17] Lorenzo Baraldi, Costantino Grarna, and Cucchiara. Hierarchical
boundary-aware neural encoder for video captioning. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 3185–3194, Jul. 2017.
[BH89] Pierre Baldi and Kurt Hornik. Neural networks and principal component
analysis: Learning from examples without local minima. Neural Net-
works, 2(1):53–58, 1989.
[BHS15] Mohamed Bennasar, Yulia Hicks, and Rossitza Setchi. Feature selection
using joint mutual information maximisation. Expert Systems with Appli-
cations, 42(22):8520–8532, 2015.
[BKK18] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Convolutional sequence
modeling revisited. In International Conference on Learning Represen-
tations (ICLR), 2018.
[BL05] P. Bo and Lillian Lee. sentence polarity dataset. http://www.cs.
cornell.edu/people/pabo/movie-review-data/, 2005.
[BSF94] Yoshua Bengio, Patrice Simard, and Paolo Frasoni. Learning long-term
dependencies with gradient descent is difficult. Neural Networks, 5:157–
166, 1994.
[CC14] Danqi Chen and Manning Christopher. A fast and accurate dependency
parser using neural networks. In In Proceedings of The Empirical Meth-
ods in Natural Language Processing (EMNLP 2014), pages 740–750,
2014.
[CFH
16] Hao Cheng, Hao Fang, Xiaodong He, Jianfeng Gao, and Li Deng. Bi-
directional attention with agreement for dependency parsing. In Proceed-
ings of The Empirical Methods in Natural Language Processing (EMNLP
2016), 2016.
[CGCB14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Ben-
gio. Empirical evaluation of gated recurrent neural networks on sequence
modeling. arXiv preprint, (arXiv:1412.3555), Dec 2014.
127
[CHK
18] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and
Richard Bowden. Neural sign language translation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages
7784–7793, 2018.
[CLC17] Lacer Calixto, Qun Liu, and Nick Campbell. Doubly-attentive decoder
for multi-modal neural machine translation. In Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics 2017,
pages 1913–1924, 2017.
[CLR
18] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic
Denoyer, and Herv´ e J´ egou. Word translation without parallel data. In
International Conference on Learning Representations (ICLR), 2018.
[CMG
14] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine
translation. Proc. EMNLP’2014, 2014.
[CXHW17] Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. Improving sentiment
analysis via sentence type classification using bilstm-crf and cnn. Expert
Systems with Applications, 72:221–230, Apr 2017.
[CZLZ16] Kewen Chen, Zuping Zhang, Jun Long, and Hao Zhang. Turning from
tf-idf to tf-igm for term weighting in text classification. Expert Systems
with Applications, 66:245–260, Dec 2016.
[DBFF02] P. Duygulu, K. Barnard, J.F.G. de Freitas, and D.A. Forsyth. Object
recognition as machine translation: Learning a lexicon for a fixed image
vocabulary. In Proceedings of the European conference on computer
vision, pages 97–112, Berlin, 2002. Springer.
[DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language under-
standing. arXiv preprint arXiv:1810.04805, 2018.
[DDF
90] Scott Deerwester, T. Susan Dumais, W. George Furnas, K. Thomas Lan-
dauer, and Richard Harshman. Indexing by latent semantic analysis. Jour-
nal of the American society for information science, 41:391, 1990.
[DDS
09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.
Imagenet: A large-scale hierarchical image database. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages
248–255. IEEE, 2009.
128
[Duc11] Duchi. Adaptive subgradient methods for online learning and stochastic
optimization. The Journal of Machine Learning Research, pages 2121–
2159, 2011.
[EFB
17a] Desmond Elliott, Stella Frank, Lo¨ ıc Barrault, Fethi Bougares, and
Lucia Specia. Findings of the second shared task on multimodal
machine translation and multilingual image description. arXiv preprint
arXiv:1710.07177, 2017.
[EFB
17b] Desmond Elliott, Stella Frank, Lo¨ ıc Barrault, Fethi Bougares, and Lucia
Specia. Findings of the second shared task on multimodal machine trans-
lation and multilingual image description. In Proceedings of the Sec-
ond Conference on Machine Translation, Volume 2: Shared Task Papers,
pages 215–233, Copenhagen, Denmark, September 2017. Association for
Computational Linguistics.
[EFSS16a] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia.
Multi30k: Multilingual english-german image descriptions. In Proceed-
ings of the 5th Workshop on Vision and Language, 2016.
[EFSS16b] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia.
Multi30k: Multilingual english-german image descriptions. In Proceed-
ings of the 5th Workshop on Vision and Language, pages 70–74. Associ-
ation for Computational Linguistics, 2016.
[EK17] Desmond Elliott and K´ ad´ ar. Imagination improves multimodal transla-
tion. In Proceedings of the 8th International Joint Conference on Natural
Language Processing, pages 130–141, Taipei, 2017.
[Elm90] Jeffrey Elman. Finding structure in time. Cognitive Science, 14:179–211,
1990.
[FDM97] Nir Friedman, Geiger Dan, and Goldszmidt Moises. Bayesian network
classifiers. Machine learning, 29:131–163, 1997.
[FHC
18] Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob
Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate
Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for
vision-and-language navigation. In Advances in Neural Information Pro-
cessing Systems (NIPS) 2018, pages 3318–3329, 2018.
[GAMY
17] J. Gehring, Grangier Auli M, D. Yarats, Yarats Denis, and Dauphin
Yann N. Convolutional sequence to sequence learning. In arXiv preprint,
number 1705.03122, May 2017.
129
[GM18] Satya Krishna Gorti and Jeremy Ma. Text-to-image-to-text transla-
tion using cycle consistent adversarial networks. arxiv preprint, page
arXiv:1808.04538, 2018.
[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech
recognition with deep recurrent neural networks. In Acoustics, speech
and signal processing (icassp), 2013 ieee international conference on.
IEEE, pages 6645–6649, May 2013.
[GSC00] F. A. Gers, Jurgen Schmidhuber, and Fred Cummins. Learning to forget:
Continual prediction with lstm. Neural Computation, pages 2451–2471,
2000.
[GSK
17] Klaus Greff, Rupesh K. Srivastava, Jan Koutnik, Bas R. Steunebrink, and
Jurgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions
on neural networks and learning systems, 28(10):2222–2232, Oct 2017.
[GSZ13] Manoochehr Ghiassi, James Skinner, and David Zimbra. Twitter brand
sentiment analysis: A hybrid system using n-gram analysis and dynamic
artificial neural network. Expert Systems with Applications, 40(16):6266–
6282, 2013.
[HS97] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.
Neural Computation, 9:1735–1780, 1997.
[HSFA18] Arslan Hasan Sait, Mark Fishel, and Gholamreza Anbarjafari. Dou-
bly attentive transformer machine translation. arxiv preprint, page
arXiv:1807.11605, 2018.
[HXQ
16] Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and
Wei-Ying Ma. Dual learning for machine translation. In Advances in
Neural Information Processing Systems, pages 820–828, 2016.
[HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
IEEE, 2016.
[HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 770–778, 2016.
[IR10] Alexander Ilin and Tapani Raiko. Practical approaches to principal com-
ponent analysis in the presence of missing values. Journal of Machine
Learning Research, pages 1957–2000, 2010.
130
[JGBM17] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov.
Bag of tricks for efficient text classification. In European Chapter of the
Association for Computational Linguistics (EACL), 15th Conference of,
volume 1, pages 427–431, 2017.
[Joa98] Thorsten Joachims. Text categorization with support vector machines:
Learning with many relevant features. In European Conference on
machine learning, pages 137–142, Apr 1998.
[Jor97] Michael Jordan. Serial order: A parallel distributed processing approach.
Advances in Psychology, 121:471–495, 1997.
[KAS14] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions:
Recovering the syntax and semantics of goal-directed human activities.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 780–787, 2014.
[KB15] Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic opti-
mization. In Proceedings of the 3th International Conference on Learning
Representations (ICLR). ICLR, 2015.
[KBDB13] Efstratios Kontopoulos, Christos Berberidis, Theologos Dergiades, and
Nick Bassiliades. Ontology-based sentiment analysis of twitter posts.
Expert Systems with Applications, 40(10):4065–4074, 2013.
[KFF15] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for
generating image descriptions. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3128–3137, 2015.
[KJL15] Andrej Karpathy, Justin Johnson, and Fei-Fei Li. Visualizing and under-
standing recurrent networks. arXiv preprint arXiv:1506.02078, 2015.
[LCDR18] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and
Marc’Aurelio Ranzato. Unsupervised machine translation using
monolingual corpora only. In International Conference on Learning
Representations (ICLR), 2018.
[LHM18] Jind˘ rich Libovick´ y, Jind˘ rich Helcl, and David Mare˘ cek. Input combina-
tion strategies for multi-source transformer decoder. In Proceedings of
the Third Conference on Machine Translation (WMT) 2018, volume 1,
pages 253–260, 2018.
[Lin88] Ralph Linsker. Self-organization in a perceptual network. Computer,
21(3):105–117, Mar 1988.
131
[LLXZ16] Siwei Lai, Kang Liu, Liheng Xu, and Jun Zhao. How to generate a good
word embedding. IEEE Intelligent Systems, 31:5–14, 2016.
[LOC
18] Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and
Marc’Aurelio Ranzato. Phrase-based & neural unsupervised machine
translation. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 2018.
[MLW
19] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AIRegib, Zsolt Kira,
Richard Socher, and Caiming Xiong. Self-monitoring navigation agent
via auxiliary progress estimation. In International Conference on Learn-
ing Representations (ICLR) 2019, 2019.
[MLY18] W. James Murdoch, Peter J. Liu, and Bin Yu. Beyond word importance:
Contextual decomposition to extract interactions from lstms. In Interna-
tional Conference on Learning Representations (ICLR), 2018.
[MP18] Marcin Michal Mir´ nczuk and Jaroslaw Protasiewicz. A recent overview
of the state-of-the-art elements of text classification. Expert Systems with
Applications, 106:36–54, Sep 2018.
[MPBS16] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhut-
dinov. Generating images from captions with attention. In International
Conference on Learning Representations (ICLR), 2016.
[MSC
13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey
Dean. Distributed representations of words and phrases and their compo-
sitionality. In Advances in neural information processing systems, pages
3111–3119, 2013.
[MSM93] Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz.
Building a large annotated corpus of english: the penn treebank. Compu-
tational Linguistics, 19:313–330, 1993.
[MVN13] Rodrigo Moraes, Francisco Valiati, Jo¯ aO, and Wilson P. Gavi¯ aO Neto.
Document-level sentiment classification: An empirical comparison
between svm and ann. Expert Systems with Applications, 40(2):621–633,
2013.
[MXY
17] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. Explain
images with multimodal recurrent neural networks. arXiv preprint, page
arXiv:1410.1090, 2017.
[Nes83] Yurii Nesterov. A method of solving a convex programming problem with
convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27,
pages 372–376, 1983.
132
[NN17] Hideki Nakayama and Noriki Nishida. Zero-resource machine trans-
lation by multimodal encoder–decoder network with multimedia pivot.
Machine Translation, 31(1-2):49–64, 2017.
[PGCB14] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Ben-
gio. How to construct deep recurrent neural networks. arXiv:1312.6026,
2014.
[RAY
16] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Schiele, and Honglak Lee. Generative adversarial text to image synthe-
sis. In Proceedings of the 33rd International Conference on International
Conference on Machine Learning-Volume 48, pages 1060–1069. JMLR.
org, 2016.
[RMB13] Pascanu Razvan, Tomas Mikolov, and Yoshua Bengio. On the difficulty
of training recurrent neural networks. In Proceedings of The International
Conference on Machine Learning (ICML 2013), pages 1310–1318, 2013.
[SB88] Gerard Salton and Christopher Buckley. Term-weighting approaches
in automatic text retrieval. Information processing & management,
24(5):513–523, Jan 1988.
[SFB
19] Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay Kuo, and Fei Huang.
Unsupervised multi-modal neural machine translation. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
2019, Jun 2019.
[SFSE16] Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. A
shared task on multimodal machine translation and crosslingual image
description. In Proceedings of the First Conference on Machine Transla-
tion, volume 2, pages 543–553. IEEE, 2016.
[SHK
14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: a simple way to prevent neural net-
works from overfitting. The Journal of Machine Learning Research,
15(1):1929–1958, 2014.
[SHK18] Yuanhang Su, Yuzhong Huang, and Jay C.-C. Kuo. Efficient text classi-
fication using tree-structured multi-linear principal component analysis.
In Pattern recognition (ICPR), 2018 International conference on. IEEE,
Apr 2018.
[SK15] Yuanhang Su and C.-C. Jay Kuo. Fast and robust camera’s auto exposure
control using convex or concave model. In Proceedings of IEEE Confer-
ence on Consumer Electronics (ICCE) 2015, pages 13–14, Jan 2015.
133
[SK18] Yuanhang Su and C.-C. Jay Kuo. On extended long short-term mem-
ory and dependent bidirectional recurrent neural network. arXiv,
(arXv:1803.01686), 2018.
[SLK16] Yuanhang Su, Joe Yuchieh Lin, and Jay C.-C. Kuo. A model-based
approach to camera’s auto exposure control. Journal of Visual Communi-
cation and Image Representation, pages 122–129, Apr 2016.
[SLK19] Yuanhang Su, Ruiyuan Lin, and C.-C. Jay Kuo. Tree-structured multi-
stage principal component analysis (tmpca): theory and applications.
Expert Systems with Applications, pages 355–364, Mar 2019.
[SP97] Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural net-
works. Signal Processing, 45:2673–2681, 1997.
[SPW
13] R. Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Man-
ning, Andrew Ng, and Christopher Potts. Recursive deep models for
semantic compositionality over a sentiment treebank. https://nlp.
stanford.edu/sentiment/, 2013.
[Staa] Standford. Corenlp. https://stanfordnlp.github.io/
CoreNLP/.
[Stab] Nick Statt. Google’s ai translation system is
approaching human-level accuracy. https://
www.theverge.com/2016/9/27/13078138/
google-translate-ai-machine-learning-gnmt.
[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence
learning with neural networks. Advances in Neural Information Process-
ing Systems, pages 3104–3112, 2014.
[TMS
16] Joji Toyama, Masanori Misono, Masahiro Suzuki, Kotaro Nakayama, and
Yutaka Matsuo. Neural machine translation with latent semantic of image
and text. arXiv preprint, page arXiv:1611.08459, 2016.
[Uys16] Alper Kursat Uysal. An improved global feature selection scheme for text
classification. Expert Systems with Applications, 43:82–92, Jan 2016.
[VKK
15] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton.
Grammar as a foreign language. Advances in Neural Information Pro-
cessing Systems, pages 2773–2781, 2015.
[VNO18] Khrulkov Valentin, Alexander Novikov, and Ivan Oseledets. Expres-
sive power of recurrent neural networks. In International Conference
on Learning Representations (ICLR), 2018.
134
[VSP
17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten-
tion is all you need. In Proceedings of Advances in Neural Information
Processing Systems (NIPS), pages 5998–6008. Curran Associates, 2017.
[VTBE15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan.
Show and tell: A neural image caption generator. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages
3156–3164. IEEE, 2015.
[Wea16] Yonghui Wu and et. al. Google’s neural machine translation sys-
tem: Bridging the gap between human and machine translation.
arXiv:1609.08144, 2016.
[WHC
19] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan
Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced
cross-modal matching and self-supervised imitation learning for vision-
language navigation. In Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition (CVPR) 2019, Jun 2019.
[WKN
13] T. Wilson, Zornitsa Kozareva, Preslav Nakov, Sara Rosenthal, Veselin
Stoyanov, and Alan Ritter. International workshop on semantic evaluation
2013: Sentiment analysis in twitter. https://www.cs.york.ac.
uk/semeval-2013/task2.html, 2013.
[WLC
15] Tingting Wei, Yonghe Lu, Huiyou Chang, Qiang Zhou, and Xianyu Bao.
A semantic approach for text clustering using wordnet and lexical chains.
Expert Systems with Applications, 42(4):2264–2275, 2015.
[WPDN18] Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. Switchout:
an efficient data augmentation algorithm for neural machine translation.
In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 856–861, 2018.
[WSC
16] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation. arXiv preprint
arXiv:1609.08144, 2016.
[WXWW18] Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang.
Look before you leap: Bridging model-free and model-based reinforce-
ment learning for planned-ahead vision-and-language navigation. In Pro-
ceedings of the European Conference on Computer Vision (ECCV) 2018,
pages 37–53, 2018.
135
[XBK
15] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show,
attend and tell: Neural image caption generation with visual attention.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2048–2057. IEEE, Jun 2015.
[XZH
18] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan,
Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image
generation with attentional generative adversarial networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1316–1324, 2018.
[YLL18] Chen Yun, Yang Liu, and Victor O.K. Li. Zero-resource neural machine
translation with multi-agent communication game. In Thirty-second
AAAI Conference on Artificial Intelligence 2018, Apr 2018.
[YP97] Yiming Yang and J.O. Pedersen. A comparative study on feature selection
in text categorization. Icml, 97:412–420, Jul 1997.
[YZL09] Qiang Ye, Ziqiong Zhang, and Rob Law. Sentiment classification of
online reviews to travel destinations by supervised machine learning
approaches. Expert Systems with Applications, 36(3):6527–6535, 2009.
[ZEL
18] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed
Elgammal. A generative adversarial approach for zero-shot learning from
noisy texts. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 1004–1013, 2018.
[ZJL15] Xiang Zhang, Zhao Junbo, and Yann LeCun. Character-level convolu-
tional networks for text classification. Advances in neural information
processing systems, pages 649–657, 2015.
[ZPIE17] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired
image-to-image translation using cycle-consistent adversarial networkss.
In Computer Vision (ICCV), 2017 IEEE International Conference on,
2017.
[ZZ05] X. Zhang and Junbo Zhao. Character-level convolutional networks for
text classification. http://goo.gl/JyCnZq, 2005.
136
Abstract (if available)
Abstract
In this thesis, we are trying to improve the sequence (language, speech, video, etc.) learning system’s learning capabilities by enhancing its memory and enabling its visual imagination. ❧ For the former, we focus on understanding the memory mechanism for sequence learning problems via recurrent neural networks (RNN) and dimension reduction (DR). We put forward two viewpoints. The first viewpoint is that memory is a function that maps certain elements in the input sequences to the current output. Such definition, for the first time in literature, allows us to do detailed investigation of the memory of three basic RNN cell models, namely: simple RNN (SRN), long short-term memory (LSTM), and gated recurrent unit (GRU). We found that all three RNN cell models suffer memory decay. To overcome this limitation by design, we propose a new basic RNN model called extended LSTM (ELSTM) with trainable scaling factors for better memory attention. The ELSTM achieves outstanding performance for various complex language tasks. The second viewpoint is that memory is a compact representation of sparse sequential data. From this perspective, dimension reduction (DR) method like principal component analysis (PCA) becomes attractive. However, there are two known problems in implementing PCA for sequence learning problems: the first is computational complexity
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neural sequence models: Interpretation and augmentation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Deep learning for subsurface characterization and forecasting
PDF
Learning to diagnose from electronic health records data
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Hashcode representations of natural language for relation extraction
PDF
Modeling, learning, and leveraging similarity
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Recognition and characterization of unstructured environmental sounds
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Learning multi-annotator subjective label embeddings
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Leveraging structure for learning robot control and reactive planning
PDF
Experimental analysis and feedforward design of neural networks
PDF
Multimodal reasoning of visual information and natural language
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
Asset Metadata
Creator
Su, Yuanhang
(author)
Core Title
Theory of memory-enhanced neural systems and image-assisted neural machine translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/07/2019
Defense Date
04/16/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
auto-encoder,bidirectional recurrent neural network,deep learning,dimension reduction,encoder-decoder,gated recurrent unit,image captioning,long short-term memory,machine learning,machine translation,multi-modal,mutual information,natural language processing,neural machine translation,neural networks,OAI-PMH Harvest,principal component analysis,recurrent neural networks,text classification,Transformer,unsupervised learning,word embedding
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Knight, Kevin (
committee member
), Ren, Xiang (
committee member
)
Creator Email
suyuanhang@googlemail.com,suyuanhang@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-171913
Unique identifier
UC11660657
Identifier
etd-SuYuanhang-7461.pdf (filename),usctheses-c89-171913 (legacy record id)
Legacy Identifier
etd-SuYuanhang-7461.pdf
Dmrecord
171913
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Su, Yuanhang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
auto-encoder
bidirectional recurrent neural network
deep learning
dimension reduction
encoder-decoder
gated recurrent unit
image captioning
long short-term memory
machine learning
machine translation
multi-modal
mutual information
natural language processing
neural machine translation
neural networks
principal component analysis
recurrent neural networks
text classification
Transformer
unsupervised learning
word embedding