Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodal reasoning of visual information and natural language
(USC Thesis Other)
Multimodal reasoning of visual information and natural language
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multimodal Reasoning of Visual Information and Natural Language
by
Kan Chen
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2019
Copyright 2019 Kan Chen
To my advisor Ram, my parents Ningrong and Jun
ii
Acknowledgments
I would like to express my sincere thanks to my advisor Prof. Ramakant Nevatia
for his guidance and support. I have learned a lot from his deep understanding, broad
knowledge and insightful vision about Articial Intelligence and Computer Vision. During
my study at the Computer Vision lab, I have been lucky to discuss with Prof. Nevatia
about research and life, several times per week. It is truly an invaluable experience.
I would like to thank Prof. Keith Jenkins, Prof. Ulrich Neumann, Prof. C.-C. Jay Kuo
and Prof. Antonio Ortega for taking their precious time to serve on my qualication and
thesis defense committee; Prof. Shengjin Wang and Raquel Urtasun for introducing the
concept of Computer Vision to me back in the Tsinghua days; Song Cao for all the kind
help when I rst arrived in the U.S.; Chen Sun for advice and help on research; Jiang Wang
and Wei Xu for collaboration at Baidu IDL; Chen Fang, Zhaowen Wang and Trung Bui
for collaboration at Adobe; Zeki Yalniz, Yixuan Li and Manohar Paluri for collaboration
at Facebook; Rama Kovvuri, Jiyang Gao and Zhenheng Yang for collaboration at USC.
I also want to thank the current and previous members of our Computer Vision lab, and
all my friends at USC.
I am grateful to my parents Ningrong and Jun for their understanding and support
over the past 27 years. My gratitude is beyond the words. My life in the U.S. has been
made colorful by my friends Chuanxi Zhang, Linna Wang, Hang Fu, Peng Jiang, Weiyue
Wang, Xin Zhang, Xu Yuan, Yingjie He, Dongjie Zhang, Hao Gao, Arnav Agharwal,
Pramod Sharma, Runzhou Ge, Chuanzi He, Arka Sadhu and so many others.
iii
Table of Contents
Acknowledgments iii
List of Tables vii
List of Figures x
Abstract xiv
1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 A Very Brief Review of History . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 AMC: Attention guided Multi-modal Correlation Learning for Im-
age Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 MSRC: Multimodal Spatial Regression with Semantic Context for
Phrase Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.3 Query-guided Regression Network with Context Policy for Phrase
Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.4 Knowledge Aided Consistency for Weakly Supervised Phrase
Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.5 ABC-CNN: An Attention Based Convolutional Neural Network for
Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . 6
1.4.6 Visually Indicated Sound Generation by Perceptually Optimized
Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 9
2.1 Image Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Image Phrase Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Coarse Level Multimodal Reasoning for Image Search 13
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 AMC Learning From Click-through Data . . . . . . . . . . . . . . . . . . 15
iv
3.2.1 AMC learning framework . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Multi-modal inter-attention network (MTN) . . . . . . . . . . . . . 17
3.2.3 Visual intra-attention network (VAN) . . . . . . . . . . . . . . . . 18
3.2.4 Language intra-attention network (LAN) . . . . . . . . . . . . . . 19
3.2.5 Applications of AMC space . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Multi-modal image retrieval . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 Caption ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Mid-Level Multimodal Reasoning for Phrase Grounding (Part I) 31
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 MSRC System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Spatial Regression Network (SRN) . . . . . . . . . . . . . . . . . . 36
4.2.3 Context Renement Network (CRN) . . . . . . . . . . . . . . . . . 38
4.2.4 Training & Phrase grounding of MSRC . . . . . . . . . . . . . . . 38
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3 Performance on Flickr30K Entities . . . . . . . . . . . . . . . . . . 41
4.3.4 Performance on Refer-it Game . . . . . . . . . . . . . . . . . . . . 44
4.3.5 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Mid-Level Multimodal Reasoning for Phrase Grounding (Part II) 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 QRC Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Proposal Generation Network (PGN) . . . . . . . . . . . . . . . . 51
5.2.3 Query guided Regression Network (QRN) . . . . . . . . . . . . . . 52
5.2.4 Context Policy Network (CPN) . . . . . . . . . . . . . . . . . . . . 53
5.2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 Performance on Flickr30K Entities . . . . . . . . . . . . . . . . . . 57
5.3.4 Performance on Referit Game . . . . . . . . . . . . . . . . . . . . . 60
5.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Mid-Level Multimodal Reasoning for Phrase Grounding (Part III) 64
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 KAC Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.2 Knowledge Based Pooling (KBP) . . . . . . . . . . . . . . . . . . . 69
6.2.3 Visual Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.4 Language Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 71
v
6.2.5 Training & Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.3 Performance on Flickr30K Entities . . . . . . . . . . . . . . . . . . 74
6.3.4 Performance on Referit Game . . . . . . . . . . . . . . . . . . . . . 77
6.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7 Knowledge Level Multimodal Reasoning for Visual Question Answering 81
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Attention Based Congurable CNN . . . . . . . . . . . . . . . . . . . . . . 84
7.2.1 Attention Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2.2 Question Understanding . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.3 Image Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.4 Answer Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2.5 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3.4 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8 Reasoning between Visual and Audio Modalities 97
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2.2 Classication based Audio generation Network . . . . . . . . . . . 100
8.2.3 Perceptual optimization . . . . . . . . . . . . . . . . . . . . . . . . 102
8.2.4 Training & sound wave generation . . . . . . . . . . . . . . . . . . 102
8.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.4.2 Performance on GHD . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.4.3 Performance on VIG . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.4.4 Qualitative evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 111
9 Conlusion and Future Work 112
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 115
vi
List of Tables
3.1 Dierent models evaluated on Clickture and ASD. Language intra-attention
network (LAN) is applied on keyword modality. Visual intra-attention
network (VAN) is applied on image modality. Late fusion (LF) and multi-
modal inter-attention networks (MTN) are applied on multi-modalities. . 24
3.2 Performance of dierent models on Clickture dataset. The evaluation met-
rics are NDCG@5, 10, 15, 20, 25 (correspond to 2
nd
to 6
th
column). For
k2f5; 10; 10; 20; 25g, we exclude queries with ranking list size less than k
when we calculate NDCG@k. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Models' performance under dierent metrics . . . . . . . . . . . . . . . . . 25
3.4 Performance of dierent models on ASD. The evaluation metrics are R@1,
5, 10, 15, 20 (correspond to 2
nd
to 6
th
column). . . . . . . . . . . . . . . . 26
3.5 Dierent models evaluated on CIC. Late fusion (LF) and inter-attention
(MTN) networks are applied on multi-modalities. Caption modality is
represented by Skip-thought vector (Skip). Image modality is represented
by either VGG features (VGG) or Resnet features (Res). . . . . . . . . . . 28
3.6 Performance of dierent models on CIC. The evaluation metrics are R@1,
5, 10(correspond to 2
nd
to 4
th
column). AMC models achieve competitive
performance with only skip-thought vectors for caption modality among
all VQA-agnostic models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Dierent models' performance on Flickr30K Entities. CRN is netuned
based on MNN with Regression layer and take VGG
det
-SPAT1 as input
visual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Phrase grounding performances in dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage. . . . . . . . . . . . . . . . . . . . . . . 43
4.3 SRN MNN+Reg (VGG
det
-SPAT1) model's performance (accuracy in %)
under dierent dimension of multimodal subspace (weight of regression loss
= 1:0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 SRN MNN+Reg (VGG
det
-SPAT1) model's performance (accuracy in %)
under dierent weight of regression loss in Eq. 4.1 (multimodal subspace
dimension m = 128). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vii
4.5 Dierent models' performance on Refer-it Game. Since there is no context
information annotated, we only evaluate SRN models . . . . . . . . . . . . 44
4.6 SRN MNN+Reg (VGG
cls
-SPAT2) model's performance (accuracy in %)
under dierent dimension of multimodal subspace on Refer-it Game dataset.
We x weight of regression loss = 1:0. . . . . . . . . . . . . . . . . . . . 45
4.7 SRN MNN+Reg (VGG
cls
-SPAT2) model's performance (accuracy in %)
under dierent co-ecient of regression loss in Eq. 4.1. We x multimodal
subspace dimension m = 128. . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Dierent models' performance on Flickr30K Entities. Our framework is
evaluated by combining with various proposal generation systems. . . . . 58
5.2 Comparison of dierent proposal generation systems on Flickr30k Entities 58
5.3 QRC Net's performances on Flickr30K Entities for dierent weights of
L
reg
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 QRC Net's performances on Flickr30K Entities for dierent dimensions m
of v
q
i
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 QRC Net's performances on Flickr30K Entities for dierent reward values
of CPN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6 Dierent models' performance on Referit Game dataset. . . . . . . . . . . 60
5.7 Phrase grounding performances for dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage. . . . . . . . . . . . . . . . . . . . . . . 62
5.8 QRC Net's performances on Referit Game for dierent weights of L
reg
. 62
5.9 QRC Net's performances on Regerit Game for dierent dimensions m of v
q
i
. 63
6.1 Dierent models' performance on Flickr30K Entities. We explicitly eval-
uate performance of visual consistency (VC), language consistency (LC)
branches with Hard and Soft KBP Gates. We leverage knowledge from
MSCOCO [66] classication task. . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Comparison of KAC Net using dierent KBP gates and external knowledge
on Flickr30k Entities. Accuracy is in %. . . . . . . . . . . . . . . . . . . . 75
6.3 Phrase grounding performances for dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage. . . . . . . . . . . . . . . . . . . . . . . 76
6.4 Dierent models' performance on Referit Game. We leverage knowledge
from MSCOCO [66] classication task. . . . . . . . . . . . . . . . . . . . . 77
6.5 Comparison of KAC Net using dierent KBP gates and external knowledge
on ReferitGame. Accuracy is in %. . . . . . . . . . . . . . . . . . . . . . . 77
6.6 Dierent methods on Flickr30K Entities [93] (left) and Referit Game [52]
(right) for two types of queries. Accuracy is in %. . . . . . . . . . . . . . . 78
viii
7.1 Results on Toronto COCO-QA dataset [96] . . . . . . . . . . . . . . . . . 92
7.2 Toronto COCO-QA [96] accuracy per category . . . . . . . . . . . . . . . 93
7.3 Results on DAQUAR-reduced dataset [74] . . . . . . . . . . . . . . . . . . 94
7.4 Results on DAQUAR-full dataset [74] . . . . . . . . . . . . . . . . . . . . 95
7.5 Performances of dierent models on VQA dataset [2] . . . . . . . . . . . . 95
8.1 Number training and testing samples in Visually Indicated sound Genera-
tion (VIG) dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.2 Dierent models' performances of R@K on GHD (K=1, 5, 10) . . . . . . . 107
8.3 Classication accuracy of dierent model's generated sound by a pre-
trained 5-layer neural network classier. . . . . . . . . . . . . . . . . . . . 107
8.4 Dierent models' performances of R@K on VIG (K=1, 5, 10) . . . . . . . 108
ix
List of Figures
1.1 Illustration of the challenges in multimodal reasoning. First is how to en-
code eective features in each modality. After extracting useful features,
how to correlate them, nd the most related sample and produce an ap-
propriate answer is the challenge of reasoning. Finally, we need to handle
large scale data and avoid expensive annotation collection procedure in
building multimodal reasoning systems. . . . . . . . . . . . . . . . . . . . 3
3.1 For dierent queries, it is helpful to select query-dependent information
within and cross rich image-related modalities available on the Internet.
Bounding boxes and highlighted keywords correspond to dierent queries'
intent by their colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Attention guided Multi-modal Correlation (AMC) learning framework.
Left: Given a query, images and related keywords are projected to a
raw embedding space. AMC model then generates a query-guided multi-
modal representation for each image. The correlation between query and
image is measured by the cosine distance in the AMC space. Right:
AMC model consists of a visual intra-attention network (VAN), a language
intra-attention network (LAN) and a multi-modal inter-attention network
(MTN). VAN and LAN attend on informative parts within each modality
and MTN balances the importance of dierent modalities according to the
query's intent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Images with keywords in Clickture [121] (left) and COCO image caption
dataset [66] (right). Since each image is associated with100 keywords,
not all keywords are listed. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 ROC curve for dierent models. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Visualization of AMC model's VAN, LAN and MTN results. First column:
Input query and importance of visual and language modalities produced by
MTN. Second and third columns: original images and query-guided atten-
tion maps produced by VAN. Fourth column: Some keywords highlighted
by LAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
x
4.1 Multimodal Spatial Regression with semantic Context (MSRC) system re-
gresses each proposal based on query's semantics and visual features. Be-
sides, MSRC takes advantage of context cues to lter out confusing candi-
dates and rene regression results. (Each regression box's ID corresponds
to proposal box's ID, with condence on the top-left corner.) . . . . . . . 32
4.2 Structure of MSRC system. (a) An example image and query phrases: For
query \A woman" (blue text), queries in red text are considered as its con-
text, which are further utilized by CRN. Input image is represented as a set
of proposal bounding boxes (green), and the ground truth for the query is
the red box. (b) Structure of SRN: SRN takes proposals and query phrase
as inputs. Multimodal features are encoded by a Multimodal Neural Net-
work (MNN). SRN predicts each proposal's probability of being related
to the query as well as regression parameters to localize the mentioned
object. (c) Framework of MSRC: A SRN is rst trained and utilized to
netune CRN later. CRN renes probability predicted by SRN via encod-
ing context information. (d): Structure of CRN: Each (language, proposal
set) pair has a SRN to predict condence. All SRNs share weights during
training. We propose a joint prediction loss to encode context information. 34
4.3 Some phrase grounding results generated by MSRC system in Flickr30K
and Refer-it Game datasets. We visualize ground truth bounding box,
selected proposal box and regressed bounding box in blue, green and red
resepctively. First three rows are phrase grounding results in Flickr30K
Entities dataset. First column is input image and query phrases coming
from the same image caption. The 2
nd
4
th
columns correspond to dier-
ent queries and grounding results. Forth row contains grounding results
in Refer-it Game dataset. For dierent queries, MSRC system is able to
localize objects in same images. However, when query is not clear with-
out further context information, MSRC system may ground wrong objects
(image in row four, column four). . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 QRC Net rst regresses each proposal based on query's semantics and
visual features, and then utilizes context information as rewards to rene
grounding results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Query-guided Regression network with Context policy (QRC Net) consists
of a Proposal Generation Network (PGN), a Query-guided Regression Net-
work (QRN) and a Context Policy Network (CPN). PGN generates pro-
posals and extracts their CNN features via a RoI pooling operation [97].
QRN encodes input query's semantics by an LSTM [44] model and re-
gresses proposals conditioned on the query. CPN samples the top ranked
proposals, and assigns rewards considering whether they are foreground
(FG), background (BG) or context. These rewards are back propagated as
policy gradients to guide QRC Net to select more discriminative proposals. 50
xi
5.3 Some phrase grounding results in Flickr30K Entities [93] (rst two rows)
and Referit Game [52] (third row). We visualize ground truth bounding
box, selected proposal box and regressed bounding box in blue, green and
red resepctively. When query is not clear without further context infor-
mation, QRC Net may ground wrong objects (e.g., image in row three,
column four). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 (a) supervised grounding systems, (b) state-of-the-art weakly supervised
grounding systems guided by language consistency, (c) KAC Net applies
both visual and language consistency and leverages complementary knowl-
edge from the visual feature extractor to facilitate weakly supervised ground-
ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Knowledge Aided Consistency Network (KAC Net) consists of a visual
consistency branch and a language consistency branch. Visual consistency
branch aims at predicting and aligning query-related proposals' location
parameters conditioned on the input query. Language consistency branch
attempts to reconstruct input query from query-related proposals. To pro-
vide guidance in training and testing, a Knowledge Based Pooling (KBP)
gate is applied to lter out unrelated proposals for both branches. . . . . 67
6.3 A pre-trained CNN always predicts a probability distribution for its own
task. We leverage the most probable category predicted by CNN and
calculate the word similarity between noun words in the query as knowledge
k
q
i
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Some phrase grounding results in Flickr30K Entities [93] (rst three rows)
and Referit Game [52] (forth row). We visualize ground truth bounding box
and grounding result in green and red respectively. When query is not clear
without further context information, KAC Net may ground reasonably
incorrect objects (e.g., image in row three, column two). . . . . . . . . . . 80
7.1 Attention in visual question answering. For dierent questions, the corre-
sponding attention region varies from white dashed box \coat" in the left
image to the one \umbrella" in the right image. . . . . . . . . . . . . . . . 82
7.2 The framework of ABC-CNN. The green box denotes the image feature
extraction part using CNN; the blue box is the question understanding
part using LSTM; the yellow box illustrates the attention extraction part
with congurable convolution; the red box is the answer generation part
using multi-class classication based on attention weighted image feature
maps. The orange letters are corresponding variables explained in Eq.
(7.1) - (7.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 Example images and image-related QA pairs in Toronto COCO-QA dataset
[96], DAQUAR dataset [74] and VQA dataset [2]. For VQA dataset, every
question has 10 candidate answers. We show the answer with most votes
for each question. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xii
7.4 Selected images with image-related questions and question-guided atten-
tion maps generated by ABC-CNN in Toronto COCO-QA dataset [96]. We
nd the proposed ABC-CNN model is capable of focusing its attention on
dierent regions for dierent questions. The attention maps help lter out
irrelevant information and model the spatial relationship of the objects in
images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.1 Dierence between sound of hitting \iron cabinet" and \water" in sound
wave and spectrogram. It is hard for a generic model to handle all kinds
of sound generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2 Framework of Perceptually Optimized Classication based Audio genera-
tion Network (POCAN). Video frames are rst processed by a CNN and
then fed into a LSTM. To generate sound clips, POCAN predicts sound
classes and regresses LSTM's hidden states into spectrograms and then
transform the predicted spectrograms into sound waveforms. To increase
the quality of generated sound, a pre-trained SoundNet [3] is applied to
calculate perceptual loss during the training stage. . . . . . . . . . . . . . 100
8.3 Some examples of video clips and corresponding sound clips of VIG dataset 104
8.4 Comparison of confusion matrices of sound classication results by a pre-
trained 5-layer neural network classier. Each row is the confusion made
for a single sound class. Left and right gure is confusion matrix of sound
generated by [88] and POCAN respectively. . . . . . . . . . . . . . . . . . 108
8.5 Spectrograms of ground truth sound (GT) and retrieved exemplar sound
by POCAN on GHD dataset. For each sample, we label some moments
when actions happen in GT and exemplar sound. . . . . . . . . . . . . . 109
8.6 Spectrograms of ground truth sound (GT) and retrieved exemplar sound
by POCAN on VIG dataset. For each sample, we label some moments
when actions happen in GT and exemplar sound. . . . . . . . . . . . . . 110
xiii
Abstract
Multimodal reasoning focuses on learning the correlation between dierent modalities
presented in multimedia samples. It is a important task which have many applications
in our daily lives, e.g., autonomous driving, robotics question answering, image retriev-
ing engines. It is also a challenging task which is closely related to Machine Learning,
Natural Language Processing and other research areas in Computer Sciences. Typically,
multimodal reasoning can be divided into three levels: i) Coarse level treats each modal-
ity as a uniform sample and focus on learning inter-modal correlation. ii) Fine-grained
level considers each modality's own characteristics and dive into ne-grained correlation
learning. iii) Knowledge level leverages external knowledge to deal with more complex
question-answering type reasoning.
This thesis describes my solutions to three levels of multimodal reasoning. Most of the
parts focus on the interaction between natural language modality and visual modality.
The rst part addresses the image retrieval problem which lies in the coarse level. We
introduce attention mechanism to attend on useful information within each modality and
weights each modality's importance which boosts the image retrieval performance.
In the second part, we address the phrase grounding problem which is in the ne-
grained level. We introduce regression mechanism, reinforcement learning techniques
and multimodal consistency step by step, transfer from supervised learning scenario to
weakly supervised scenario. All these above techniques bring concrete improvements in
performance of phrase grounding.
In the third part, we explore Visual Question Answering (VQA) problem in the knowl-
edge level. Similar to human's behavior in VQA, we introduce the attention mechanism
to attend on useful regions conditioned by input query's semantics, which lters out noise
in visual modality for answering questions.
xiv
Finally, we illustrate our recent eorts in other modalities' reasoning. We address the
problem of generating sound waves from video content, where we consider ne-grained
information and adopt perceptual loss to further polish generated sound waves. Besides,
we provide some interesting problems which we plan to address in the future.
This thesis demonstrates the eectiveness of all the algorithms on a range of publicly
available video or image datasets.
xv
Chapter 1
Introduction
Information exists in dierent modalities: visual information from images and videos,
textual information from natural language descriptions, acoustic information from sound,
etc. Multimodal information refers to the information simultaneously existing in multiple
modalities. How to reason and correlate multimodal information becomes an important
problem, which has many applications in our daily lives such as image search engine,
suspicious tracking and car auto-driving.
Like most of the other high-level vision tasks, there is a natural hierarchy for multi-
modal reasoning: treat each modality as a unit, and correlate them based on their global
characteristics; dive into detailed structures for each modality, and interact between dif-
ferent parts and nally, apply existing knowledge to reason and correlate dierent modal-
ities. In this hierarchy, understanding global characteristics will provide useful guidance
to reason in ne-grained parts in each modality, while applying learned knowledge can
further dig into the essential connection between dierent modalities' information and
build a robust Query-Answer (QA) system for dierent applications.
1.1 Problem Statement
To clarify the natural hierarchy of multimodal reasoning, I divide this problem into three
levels:
• Coarse level: Learning the correlation between modalities without considering each
modality's local structure.
1
• Fine-grained level: Considering detailed structure in each modality and building
ne-grained correlation between them.
• Knowledge level: Incorporate external or internal learned knowledge to build a more
robust system, which is open to free-style queries.
Correspondingly, my personal interests lie in three problems which are being actively
investigated by researchers: image search, phrase grounding and visual question answer-
ing. These three problems are dened as
• Image search (coarse level): given a natural language query, retrieve most related
images in an image database.
• Image phrase grounding (ne-grained level): given a natural language query and an
image, localize the mentioned objects in the image.
• Visually indicated sound generation (ne-grained level): given video frames, gener-
ate reasonable sound waves based on video content.
• Visual question answering (knowledge level): given an image and an image-related
natural language query, produce a natural language answer for this query.
1.2 A Very Brief Review of History
Mutlimodal reasoning has recently emerged as a hot research topic in computer vision
community, possibly due to the wide availability of data (although most of them are not
annotated) and the interest from both academia and industry.
In the early days, however, researchers spent most of their eorts on the modeling
of correlation between input queries and images in a restricted image search scenario
(e.g., xed number of queries, which is image classication problem [59], small number
of images in search space). And using Canonical Correlation Analysis (CCA) [39] and its
variance to build simple image retrieval systems.
With development of deep learning techniques, researchers became more interested in
detailed structure of images, and explored ne-grained level problems, such as image de-
tection [19] (xed number of queries) and phrase grounding [92] (freestyle queries). With
2
“A smiling woman”
Feature
encoding
Feature
encoding
Visual feature
Language feature
Model Correlation Learning
Huge scale
of data
And
Expensive
Data collection
Figure 1.1: Illustration of the challenges in multimodal reasoning. First is how to encode
eective features in each modality. After extracting useful features, how to correlate
them, nd the most related sample and produce an appropriate answer is the challenge
of reasoning. Finally, we need to handle large scale data and avoid expensive annotation
collection procedure in building multimodal reasoning systems.
deep neural networks [105] to extract strong features in visual and language modalities,
people were able to achieve a reasonable performance in these tasks.
Recently, researchers focus on more challenging visual question answering problem,
which is open to freestyle queries and answer these queries with a natural language output.
Compared to previous problems, this problem is more challenging as it may need to seek
the help of external knowledge to give an answer. Initial eorts involve of projecting visual
and language features into a subspace and using a decoder to produce the answer [111].
1.3 Challenges
To understand and reason multimodal information, one has to tackle the following chal-
lenges which is shown in Fig. 1.1:
Feature encoding. Not all information is useful in each modality. How to extract
useful information in each modality and encode it into compact features is an essential
step for the multimodal reasoning.
Modality reasoning. After paring the query's semantics and extracting visual fea-
tures, how to nd the correlation between the query and input data directly in
uence
3
the answer producing. Besides, after nding the most related visual features, how to de-
code this feature and produce the answer in an appropriate format is also an important
problem.
Scale and cost of data. Currently we only focus on academic datasets, which
contains less than 10 million images and queries. However, on average 350 million photos
are uploaded daily to Facebook, ignoring other social networks or search engines. How
to generalize the trained model to handle such a huge scale of data is a big challenge.
Moreover, academic datasets have human labeling for training. These annotations are
expensive to achieve and suer potential human bias (e.g., decision of relativeness of
images for queries). How to train the model in weakly supervised or unsupervised manners
is another challenge.
1.4 General Approach
To address these challenges, my basic philosophy is to adaptive extract features and infer
possible correct answers based on dierent queries. I mainly adopted three useful tools
during my research:
• Attention mechanism to attend on use parts in each modality conditioned on input
queries [11, 6].
• Regression mechanism to infer possible correct object's position based on query's
semantics [8, 10]
• Leveraging context information [8, 10].
These three tools are tightly connected with each other. Attention mechanism aims
at attending on more related objects based on input query, which tries to align the candi-
dates' distribution with the input query. After providing more related candidates, regres-
sion mechanism tries to predict possible ground truth location to increase the change of
nding the answer, while context information helps rule out unrelated candidates to re-
duce to chance of incorrect choice in the reasoning step. After combing these three mecha-
nisms, I manage to learn eective representations for both visual and textual modalities,
4
which naturally produce more robust and accurate answers for dierent tasks. I will
introduce some of my work towards this goal in the following paragraphs.
1.4.1 AMC: Attention guided Multi-modal Correlation Learning for
Image Search
This work focuses on image search, which is in coarse level. We leverage visual and textual
modalities for image search by learning their correlation with input query. We propose a
novel Attention guided Multi-modal Correlation (AMC) learning method which consists
of a jointly learned hierarchy of intra and inter-attention networks. Conditioned on
query's intent, intra-attention networks (i.e., visual intra-attention network and language
intra-attention network) attend on informative parts within each modality; a multi-modal
inter-attention network promotes the importance of the most query-relevant modalities.
In experiments, we evaluate AMC models on the search logs from two real world image
search engines and show a signicant boost on the ranking of user-clicked images in search
results.
1.4.2 MSRC: Multimodal Spatial Regression with Semantic Context
for Phrase Grounding
This work aims at solving phrase grounding, which is in ne-grained level. Given an
image and a natural language phrase as a query, a grounding system localizes the men-
tioned objects in the image according to the query's specications. We propose a novel
Multimodal Spatial Regression with semantic Context (MSRC) system which not only
predicts the location of ground truth (i.e., regression) based on proposal bounding boxes,
but also renes prediction results by maximizing the margin of dierent queries coming
from same sentences. The advantages of MSRC are twofold: rst, it relieves the limitation
of performance from proposal generation algorithms by using a spatial regression network.
Second, MSRC not only encodes the semantic of a query phrase, but also deals with its
relation with other queries in the same sentence (i.e., context) by adopting a context
renement network, which helps lter out confusing candidates during grounding.
5
1.4.3 Query-guided Regression Network with Context Policy for Phrase
Grounding
Compared with MSRC system, we adopt a spatial regression method to break the perfor-
mance limit, and introduce reinforcement learning techniques to further leverage semantic
context information. We propose a novel Query-guided Regression network with Con-
text policy (QRC Net) which jointly learns a Proposal Generation Network (PGN), a
Query-guided Regression Network (QRN) and a Context Policy Network (CPN). Exper-
iments show QRC Net provides a signicant improvement in accuracy on two popular
datasets: Flickr30K Entities and Referit Game, with 14.25% and 17.14% increase over
the state-of-the-arts respectively.
1.4.4 Knowledge Aided Consistency for Weakly Supervised Phrase
Grounding
Following the work of QRC and MSRC system, we further explore the phrase grounding
in the weakly supervised scenario (i.e., mapping between image proposals and language
is not available in the training set). Compared to previous methods, we explore the con-
sistency contained in both visual and language modalities, and leverage complementary
external knowledge to facilitate weakly supervised grounding. We propose a novel Knowl-
edge Aided Consistency Network (KAC Net) which is optimized by reconstructing input
query and proposal's information. To leverage complementary knowledge contained in the
visual features, we introduce a Knowledge Based Pooling (KBP) gate to focus on query-
related proposals. Experiments show that KAC Net provides a signicant improvement
on two popular datasets.
1.4.5 ABC-CNN: An Attention Based Convolutional Neural Network
for Visual Question Answering
In this work, we address the visual question answering task (VQA), which is in knowledge
level. Given an image and an image-related question, VQA returns a natural language
answer. Since dierent questions inquire about the attributes of dierent image regions,
6
generating correct answers requires the model to have question-guided attention, i.e.,
the attention on the regions corresponding to the input question's intent. We introduce
an attention-based congurable convolutional neural network (ABC-CNN) to locate the
question-guided attention based on input queries. ABC-CNN determines the attention
regions by nding the corresponding visual features in the visual feature maps with a
\congurable convolution" operation. With the help of the question-guided attention,
ABC-CNN can achieve both higher VQA accuracy and better understanding of the visual
question answering process. We evaluate the ABC-CNN architecture on three benchmark
VQA datasets: Toronto COCO-QA, DAQUAR, and VQA dataset. ABC-CNN model
achieves signicant improvements over state-of-the-art methods. The question-guided
attention generated by ABC-CNN is also shown to be the regions that are highly relevant
to the questions' intents.
1.4.6 Visually Indicated Sound Generation by Perceptually Optimized
Classication
Visually indicated sound generation aims to predict visually consistent sound from the
video content. We explore generating ne-grained sound from a variety of sound classes,
and leverage pre-trained sound classication networks to improve the audio generation
quality. We propose a novel Perceptually Optimized Classication based Audio generation
Network (POCAN), which generates sound conditioned on the sound class predicted from
visual information. Additionally, a perceptual loss is calculated via a pre-trained sound
classication network to align the semantic information between the generated sound and
its ground truth during training. Experiments show that POCAN achieves signicantly
better results in visually indicated sound generation task on two datasets.
1.5 Thesis Outline
The thesis is outlined as follows: We begin in Chapter 2 with a brief overview of literature
on image search, phrase grounding and visual question answering. Chapter 3 presents
our framework of image search with hierarchical attention networks. In Chapter 4, we
7
describe our method for image phrase grounding with regression mechanism. Chapter 5
presents our framework for phrase grounding with reinforcement learning techniques.
Based on the progress in supervised learning, we further explore the phrase grounding in
weakly supervised learning scenario in Chapter 6 The idea of visual question answering
is described in Chapter 7. In Chapter 8, we illustrate how we reason between audio and
visual modalities. Finally, Chapter 9 concludes the thesis and dicusses future directions.
8
Chapter 2
Related Work
In this chapter, we review the related works in image search, image phrase grounding and
visual question answering.
2.1 Image Search
To address the image search problem, multimodal correlation learning is a strong tool.
Canonical correlation analysis (CCA) [39] learns a cross-modal embedding space to maxi-
mize the correlation between dierent modalities. Kernel CCA (KCCA) [22] extends CCA
by adopting a non-linear mapping for dierent modalities. Alternatively, Nakayama et
al. propose kernel principle component analysis with CCA (KPCA-CCA) [85], which
generates input for CCA via non-linear KPCA method. Gong et al. [35] further include
a third view into the CCA space by the semantics between image and tags. Similarly,
partial least squares (PLS) [100] aims to measure the correlation by projecting multiple
sets of data into a latent space. Ngiam et al. [86] introduce deep multimodal learning
using neural networks. Recently, Datta et al. [16] learn the correlation between query
and multiple image-related modalities using a graph-based keyphrase extraction model.
we incorporate the attention mechanism to boost performance. Attention mechanisms
have been successfully applied in many computer vision tasks, including object detection
[80] and ne-grained image classication [67]. Jin et al. [48] develop an attention-based
model for image captioning task that employs an RNN to attend on informative regions
in images. Yang et al. [120] and Chen et al. [11] apply attention networks that focus
9
on useful regions in visual question answering (VQA) task. Xiong et al. [118] propose a
dynamic memory network to attend on informative visual or textual modality for question
answering. Recently, Lu et al. [71] propose a co-attention network to focus on both visual
and question modalities in VQA task. Compared to these methods, AMC method not
only applies intra-attention networks within each modality, but also employs MTN to
balance the importances of modalities based on query's intent for image search task.
For image search task, CCA [39] is employed to learn a subspace to maximize cor-
relation between query and image. Ranking CCA (RCCA) [121] renes the CCA space
by learning a bilinear ranking function from click-through data. Wang et al. [112] ap-
ply a deep ranking model for ne-grained image search and Tan et al. [127] introduce a
deep ranking based hashing model. Recently, Gordor et al. [37] apply a region proposal
network and Radenovi c et al. [94] adopt deep CNN features. Lynch et al. [72] transfer
deep semantic features learned from click-through data and apply them on image search
task. Compared to the approaches above, AMC method applies VAN to adaptively select
informative regions within image modality based on query's intent. On the other side,
for textual search task, Joachims [49] introduces click-through data for optimizing search
engines. DSSM [47] applies a deep framework to further leverage click-through data.
Compared to DSSM [47], AMC method employs LAN to attend on query-related words.
2.2 Image Phrase Grounding
Image phrase grounding requires learning correlation between visual and language modal-
ities. Karpathy et al. propose to align sentence fragments and image regions in a sub-
space, with a dependency tree [53] or a bi-directional RNN in [1]. Hu et al. [45] propose
a SCRC model which adopts a 2-layer LSTM to rank proposals using encoded query and
visual features. Rohrbach et al. [99] employ a latent attention network conditioned on
query which ranks proposals in unsupervised scenario. Other approaches learn the corre-
lation between visual and language modalities based on Canonical Correlation Analysis
(CCA) [39] methods. Plummer et al. [92] rst propose a CCA model to learn the mul-
timodal correlation. Wang et al. [113] employ structured matching and use phrase pairs
10
to boost performance. Recently, Plummer et al. [93] augment the CCA model to lever-
age extensive linguistic cues in the phrases. All of the above approaches are reliant on
external object proposal systems and hence, are bounded by their performance limits.
To generate candidates for phrase grounding, it is necessary to utilize a proposal gener-
ation system. Proposal generation systems are widely used in object detection and phrase
grounding tasks. Two popular methods: Selective Search [109] and EdgeBoxes [129] em-
ploy ecient low-level features to produce proposals on possible object locations. Based
on proposals, spatial regression method is successfully applied in object detection. Fast
R-CNN [33] rst employs a regression network to regress proposals generated by Selective
Search [109]. Based on this, Ren et al. [97] incorporate the proposal generation system
by introducing a Region Proposal Network (RPN) which improves both accuracy and
speed in object detection. Redmon et al. [95] employ regression method in grid level and
use non-maximal suppression to improve the detection speed. Liu et al. [69] integrate
proposal generation into a single network and use outputs discretized over dierent ratios
and scales of feature maps to further increase the performance.
To leverage context information, we apply reinforcement learning techniques. Rein-
forcement learning is rst introduced to deep neural network in Deep Q-learning (DQN) [81],
which teaches an agent to play ATARI games. Lillicrap et al. [64] modify DQN by intro-
ducing deep deterministic policy gradients, which enables reinforcement learning frame-
work to be optimized in continuous space. Recently, Yu et al. [123] adopt a reinforcer
to guide speaker-listener network to sample more discriminative expressions in referring
tasks. Liang et al. [63] introduce reinforcement learning to traverse a directed semantic
action graph to learn visual relationship and attributes of objects in images.
2.3 Visual Question Answering
VQA and image captioning are highly related because both of them need to reason
about the visual contents and present the results in a full natural language sentence
or in a word. Current state-of-the-art methods in VQA [26][75][96] and image captioning
[77][18][53][110][119][130] generally apply a CNN to extract visual features and an LSTM
model as a decoder to generate answers or captions. [26][77][75] apply a multi-modal layer
11
to combine the visual features and word embedding vectors by a joint projection during
the caption generation in the LSTM decoder. [96] employs the projected image features
as the starting states of the LSTM decoder, similar to the encoder-decoder framework
in sequence to sequence learning [106]. Treating image features as global visual features,
these studies in VQA and image captioning fail to exploit the valuable information in
questions to focus their attention on the corresponding regions in images.
To further boost performance, we borrow the idea of congurable convolutional neural
network. In [58], a dynamic convolutional layer architecture is proposed for short range
weather prediction. The convolutional kernels in the dynamic convolutional layer are
determined by a neural network encoding the information of weather images in previous
time steps, which can be applied in the scenario of VQA.
12
Chapter 3
Coarse Level Multimodal Reasoning for Image Search
3.1 Introduction
Image search by text is widely used in everyday life (e.g., search engines, security surveil-
lance, mobile phones). Given a textual query, image search systems retrieve a set of
related images by the rank of their relevance. Learning this relevance, i.e., correlation
between query and image, is key to the system's utility.
Keyword: President
Obama, Christmas holiday,
Ice-cream, Happy Malia …
Query1:
Keyword: US president,
Christmas Tree, ceremony,
family …
Query2:
Barack Obama
Christmas
Figure 3.1: For dierent queries, it is helpful to select query-dependent information within
and cross rich image-related modalities available on the Internet. Bounding boxes and
highlighted keywords correspond to dierent queries' intent by their colors.
To measure the correlation between query and image, typically a shared latent sub-
space is learned for query's text modality and a single image-related modality (e.g., visual
contents, surrounding text). Traditional image search engines [15, 98] match queries with
text or tags associated with images. DSSM [47] learns an embedding subspace to measure
the correlation between document-related text modality and query's text modality using
13
deep learning. On the other hand, cross-modal methods [121, 37, 94, 25] learn a subspace
to better measure correlation between query's text modality and image's visual modality.
In recent years, multiple image-related modalities are becoming widely available online
(e.g., images on social networks are typically posted with captions and tags, followed
by friends' comments). Text matching and cross-modal methods are suboptimal due to
their focus on only single image-related modality. As shown in Fig 5.1, image content can
provide detailed visual information (e.g., color, texture) of objects while keywords can
oer abstract concepts (e.g., scene description) or external background information (e.g.,
people's identities). Dierent modalities describe images from dierent views, which to-
gether provide information in a more comprehensive way. It benets to learn a subspace
to measure the correlation between query's text modality and image-related modalities,
i.e., multi-modal correlation.
There is a major challenge in learning this subspace: not all modalities are equally
informative due to the variation in query's intent. To overcome this problem, we intro-
duce an attention mechanism to adaptively evaluate the relevance between a modality and
query's intent. For the image search task, we consider two kinds of attention mechanisms.
First, there is query-unrelated information within each modality (e.g., background regions
in images, keyword \Ice-cream" for query2 \Christmas" in Fig 5.1); an image search sys-
tem should attend on the most informative parts for each modality (i.e., intra-attention).
Second, dierent modalities' contributions vary for dierent queries; an image search sys-
tem should carefully balance the importance of each modality according to query's intent
(i.e., inter-attention).
To address the aforementioned issues, we propose a novel Attention guided Multi-
modal Correlation (AMC) learning method. AMC framework contains three parts: visual
intra-attention network (VAN), language intra-attention network (LAN) and multi-modal
inter-attention network (MTN). VAN focuses on informative image regions according to
query's intent by generating a query-guided attention map. LAN learns to attend on
related words by learning a bilinear similarity between each word in language modality
and query. MTN is built to attend between dierent modalities. Finally, the correlation
14
between query and image-related modalities is calculated as the distance between query
embedding vector and a multi-modal embedding vector in the learned AMC space.
To validate the AMC framework, we choose image-related keywords as the language
modality and image contents as the visual modality. AMC models are evaluated on two
datasets: Clickture dataset [121] and Adobe Stock dataset (ASD). ASD is collected from
Adobe Stock search engine, including queries, images, manually curated keywords and
user clickthrough data. For Clickture, we curated keywords for all images by an auto-
tagging program developed internally at Adobe. Experiments show that AMC achieves
signicant improvement on both datasets. More importantly, this nding indicates that
AMC can benet from not only human curated data, but also information generated by
machines, which could be noisy and biased. Moreover, since AMC can scale to any number
of modalities, it has the ability to integrate and benet from the output of any intelligent
visual analysis system. We further evaluate AMC for caption ranking task on COCO
image caption data [1] with keyword set curated in the same way for Clickture [121].
AMC models achieve very competitive performance, even surpass the state-of-the-art
method in Recall@10 metric.
Our contributions are as follows: we propose a novel AMC learning framework to
select query-dependent information within and cross dierent modalities. AMC model
achieves signicant improvement in image search task. We plan to release the auto-tagged
Clickture and COCO dataset upon publication.
3.2 AMC Learning From Click-through Data
The goal of Attention guided Multi-modal Correlation learning (AMC) method is to
construct an AMC space where the correlation between query q and image x can be
measured by the distance between query's embedding vector q
m
and image's query-guided
multi-modal representation x
q
(superscript \m" denotes the multi-modal subspace in q
m
).
To learn the AMC space, we propose a hierarchy of intra and inter attention networks, i.e.,
visual intra-attention network (VAN), language intra-attention network (LAN) and multi-
modal inter-attention network (MTN). In this paper, we select image-related keywords as
15
Query: birthday party
cake, balloon, candle,
happy children …
Language modality
Visual modality
Multi-modal
Inter-attention
network
Language
Intra-attention
network
cake, balloon, candle,
happy children …
Visual
Intra-attention
network
Query
embedding
vector
Image:
0.65
Keyword:
0.35
(b) AMC Model details
v
K
q
v
q
k
q
s
q
M
q
m
p(K|q)
q
m
,x
q
qW
ql
Keyword:
birthday, …
Query:
birthday
party Keyword: cat, … Keyword: dog, …
…
q v
+
, K
+
,
"
#
"
#
,
%
#
%
#
…
AMC
model
AMC
model
AMC
model
…
Weight
shared
Weight
shared
q
m
, x
q+
q
m
,
"
'#
q
m
,
%
'#
Ranking loss
(a) AMC framework
…
AMC space
q’
Figure 3.2: Attention guided Multi-modal Correlation (AMC) learning framework. Left:
Given a query, images and related keywords are projected to a raw embedding space.
AMC model then generates a query-guided multi-modal representation for each image.
The correlation between query and image is measured by the cosine distance in the
AMC space. Right: AMC model consists of a visual intra-attention network (VAN),
a language intra-attention network (LAN) and a multi-modal inter-attention network
(MTN). VAN and LAN attend on informative parts within each modality and MTN
balances the importance of dierent modalities according to the query's intent.
the language modality and image visual contents as the visual modality, while the AMC
space can be further extended to incorporate more image-related modalities.
We rst present the AMC learning framework followed by the details of inter-attention
network (MTN). Intra-attention networks (VAN and LAN) are then introduced. Finally,
we illustrate how to apply the learned AMC space on image search and caption ranking
tasks.
3.2.1 AMC learning framework
In AMC space, the correlation between a query q and an image x is measured by the
cosine distancehq
m
; x
q
i, where q
m
2 R
d
is the embedding vector of q and x
q
2 R
d
is
the multi-modal representation of x conditioned on query's intent. To learn the AMC
space, we sampleN tuples in the form [q; (x
+
;K
+
); (x
1
;K
1
); (x
2
;K
2
);:::; (x
t
;K
t
)]
from click-through data. Each tuple consists of a query q, a positive image x
+
with its
keyword set K
+
andt negative images x
i
with their keyword sets K
i
. Given the query
16
q in a tuple, the positive image x
+
has the highest number of clicks. Similar to [121], we
adopt a common ranking loss function as the objective:
arg min
N
X
i=1
L
(q
i
;fx
+
i
;K
+
i
g;fx
ij
;K
ij
g
t
j=1
)
L
=
t
X
j=1
max(0;hq
m
i
; x
q+
i
i +hq
m
i
; x
q
ij
i)
(3.1)
where denotes the model's parameters to be optimized and is the margin between
positive and negative samples.
To learn the query's embedding q
m
and query-guided multi-modal representation
x
q
for image x, we propose a multi-modal inter-attention network (MTN) to attend on
informative modalities. The inputs of MTN are query-guided single modality embeddings
produced by intra-attention networks. Specically, intra-attention networks consist of a
visual intra-attention network (VAN) and a language intra-attention network (LAN). For
visual modality, VAN focuses on useful regions in image contents and generates a query-
guided visual embedding v
q
2 R
d
; for language modality, LAN lters out unrelated
words and generates a query-guided language embedding k
q
2R
d
. The AMC framework
is trained in an end-to-end way by integrating VAN, LAN and MTN (Fig 3.2).
For simplicity, we denote the input feature for query q as q2R
dq
. Each image x is
represented as arr feature map v2R
rrdv
. The input feature matrix for keyword set
K is denoted as K =fk
1
; k
2
;:::; k
n
g
>
2R
nd
k
, where n is the keyword set size and k
j
is thej-th keyword's feature vector of image x. d
q
,d
k
andd
v
are the feature dimensions
for query, keyword and image respectively.
3.2.2 Multi-modal inter-attention network (MTN)
MTN generates the embedding q
m
of query by projecting query's input feature q into
AMC space through a non-linear transform.
q
m
=f(W
qm
q + b
qm
) (3.2)
17
where W
qm
2R
dqd
; b
qm
2R
d
are the linear transformation matrix and bias vector to
be optimized. f(:) is a non-linear activation function. Besides, MTN encodes query's
intent q
0
using another similar transform in Eq 3.2. Conditioned on the query's intent,
the correlation of embeddings [v
q
; k
q
] produced by VAN and LAN is calculated as:
[c
v
;c
k
] =hq
0
; [v
q
; k
q
]i; q
0
=f(W
0
qm
q + b
0
qm
) (3.3)
[c
v
;c
k
] denotes the correlation of visual and language modality. h:;:i is the cosine dis-
tance measurement. f(:) is a non-linear activation function. W
0
qm
; b
0
qm
are variables to
be optimized. MTN then re-weights the visual and language modalities based on their
probabilities conditioned on the input query's intent (e.g., in Fig 3.2, the relevance scores
for visual modality (\Image") and language modality (\Keyword") are 0.65 and 0.35,
indicating visual modality is more relevant than language modality for query \birthday
party"). The conditional probability for each modality is measured based on the correla-
tion in Eq 3.3. The nal multi-modal embedding x
q
2R
d
in the AMC space is:
x
q
=p
v
v
q
+p
k
k
q
; [p
v
;p
k
] =([c
v
;c
k
]) (3.4)
where(:) is a softmax function. x
q
encodes the useful information from dierent modal-
ities conditioned on the input query's intent.
3.2.3 Visual intra-attention network (VAN)
VAN takes queryq's input feature q and imagex's feature map v as input. It rst projects
image feature map v into ad-dimension raw visual subspace by a 1x1 convolution kernel
W
v
2 R
dvd
. The projected image feature map is denoted as v
0
2 R
rrd
. Similar
to [11], VAN generates a query-guided kernel s
q
from query embedding vector q through
a non-linear transformation. By convolving the image feature map with s
q
, VAN produces
a query-guided attention map M:
M =(s
q
v
0
); s
q
=f(W
qs
q + b
qs
) (3.5)
18
where f(:) is a non-linear activation function. (:) is a softmax function and \*" is the
convolution operator. W
qs
; b
qs
are the linear transformation matrix and bias vector that
project query embedding vector q from the language space into the kernel space. The
generated attention map is of the same resolution as image feature map v
0
(rr). Each
element in the attention map represents the probability of the corresponding region in
image x being informative conditioned on the intent of query q.
VAN then renes the raw visual subspace through re-weighting each location of pro-
jected image feature map v
0
by the corresponding conditional probability in the attention
map M via element-wise production. The query-guided visual embedding vector v
q
2R
d
for image x is generated by average pooling of the re-weighted image feature map:
v
q
= AvgPool(M v
0
) (3.6)
where \AvgPool" is the average pooling operation and represents element-wise pro-
duction.
3.2.4 Language intra-attention network (LAN)
LAN takes query input feature vector q and keyword set feature matrix K as inputs. It
rst projects query q and keywords K into a raw language subspace by linear projections.
Similar to [121], the correlation between input query and keywords is measured in a
bilinear form:
s(q;K; W
ql
; W
kl
; W
l
) = (qW
ql
)W
l
(KW
kl
)
>
(3.7)
where W
ql
2 R
dqd
and W
kl
2 R
d
k
d
are transformation matrices that project query
q and keywords K into the raw subspace. W
l
2R
dd
is the bilinear similarity matrix.
Since d < d
q
;d < d
k
, fW
ql
; W
kl
; W
l
g are like an SVD decomposition of the overall
d
q
d
k
bilinear matrix. LAN then renes the raw language subspace by re-weighting
each keyword embedding vector by their probability conditioned on the query's intent.
19
This probability is measured based on the similarity between query q and keywords K
in Eq 3.7. The rened language embedding k
q
2R
d
for keyword set K is calculated as
k
q
=p(Kjq)
>
KW
kl
; p(Kjq) =(s(q;K)) (3.8)
where s(q;K) is the correlation between query and keywords calculated in Eq 3.7. (:)
is the softmax function. p(Kjq) is the probability of each keyword being informative
conditioned on the query's intent.
3.2.5 Applications of AMC space
The learned AMC space can be applied directly on two tasks: image search and caption
ranking. For image search, we rst calculate the input query q's embedding vector q
m
in
the learned AMC space. We then generate the multi-modal representationsfx
q
g for all
the images in the dataset. The images are ranked based on their relevance to the input
query, which is measured by the cosine distance between q
m
andfx
q
g.
For caption ranking, we adopt another objective function in [56] during training for
fair comparison:
L
=
X
x
X
k
maxf0;hx
q
; q
m
i +hx
q
; q
m
k
ig
+
X
q
X
k
maxf0;hx
q
; q
m
i +hx
q
k
; q
m
ig
(3.9)
where q
m
is the caption embedding vector and x
q
is the multi-modal embedding vector
of image x. The subscript k indicates negative embeddings for current caption-image
(keyword) pairs andh:;:i is the cosine distance measurement. Given a query image x
and related modalities, we rst calculate all candidate captions' embedding vectors q
m
in the learned AMC space. The multi-modal representations for images conditioned on
the caption's intentfx
q
g are then generated by the AMC model. Finally, each caption q
is ranked based on the correlation between q
m
and x
q
.
We choose the rectied linear unit (ReLU) as the activation function f(:). AMC
model's parameters consist of the variables:fW
v
; W
qs
; b
qs
; W
ql
; W
0
qm
; b
0
qm
; W
kl
; W
l
,
20
W
qm
; b
qm
g. We apply adam [54] algorithm to train the AMC framework in an end-to-end
way.
3.3 Dataset
Keyword datasets
1
. We curated two keyword datasets for Clickture [121] and COCO [66]
by an auto-tagging system. Basically, given a query image, the system rst searches sim-
ilar images from a commercial image database using a k-NN ranking algorithm. Then
the query image's keywords are generated based on a tag voting program among the
keywords associated with the images from k-NN ranking results. The Clickture keyword
dataset has over 50k unique keywords. The average size of keyword sets is 102 (minimum
is 71 and maximum is 141). There are over 26k unique keywords in COCO keyword
dataset. The average size of keyword sets is 102 (minimum size is 99 and maximum size
is 104). Compared to COCO object labels in COCO dataset [66] which have only 91
object categories, our keyword dataset is much richer and more diverse. Besides, the
keyword dataset contains multi-word phrases, upper and lower cases, which simulates the
noisy keywords collected from real-world websites (Fig 3.3).
Keyword: food, woman,
breakfast, restaurant, meal,
female, diet, young, tomato,
hands, background, dinner,
salad, orange …
Keyword: bathroom, toilet,
shower, interior, white sink,
bath, modern, WC, clean,
bathtub, home design,
house, contemporary …
Keyword: man, people,
couple, business, woman,
young, office, male, smile,
happy, caucasian, team,
listening person, female,
businessperson…
Keyword: beautiful,
people, friends, women,
group, young adult,
shopping, fun, female,
happy, attractive, men,
woman, party, male,
smiling …
Keyword: beautiful female,
couple, woman, girl, happy,
attractive, boyfriend, smiling,
beauty, friends, women,
people, young adult, fun,
caucasian, man, male, pretty,
background …
Keyword: wedding, bride, woman,
beautiful, table, couple, flower,
celebration, food, white, flowers,
happy, caucasian, setting, groom,
home, bouquet, plate, cake, girl,
adult, fun, bridal, female, love,
party, vase, day, fork, breakfast …
Figure 3.3: Images with keywords in Clickture [121] (left) and COCO image caption
dataset [66] (right). Since each image is associated with100 keywords, not all keywords
are listed.
1
Available in https://github.com/kanchen-usc/amc_att
21
Adobe Stock Dataset (ASD). We collect clickthrough data from the log les
in Adobe Stock
2
. ASD contains 1,555,821 images and 1,941,938 queries, which form
3,485,487fquery;image;clickg triads. In addition, each image is associated with a set
of keywords with an average size of 53. There are over 27k unique keywords in ASD. We
evaluate AMC models for image search task on ASD.
Clickture dataset [121] is composed of two parts: the training and development
(dev) sets. The training set contains 23.1Mfquery;image;clickg triplets. The dev set
consists of 79,926hquery;imagei pairs generated from 1000 queries. We evaluate AMC
models for image search task on Clickture with our keyword dataset.
COCO Image Caption dataset [1] (CIC). COCO image dataset [66] has 82,783
images for training and 413,915 images for validation. CIC shares the same training set
with COCO. The validation set of CIC is composed of 1,000 images sampled from the
COCO validation images, and the test set of CIC consists of 5,000 images sampled from
the COCO validation images which are not in the CIC validation set. Each image in CIC
is associated with 5 candidate captions. Same as [1], we evaluate AMC model on the rst
1,000 images for caption ranking on the CIC test set with our curated keywords.
3.4 Experiments
We evaluate our approach on Clickture [46] and Adobe Stock Dataset (ASD) for image
search task, and COCO Image Caption dataset [66] (CIC) for caption ranking task.
3.4.1 Multi-modal image retrieval
Experiment setup. For the visual modality, we divide an input image into 3x3 grids,
and apply a pre-trained 200-layer ResNet [42] to extract image feature for each grid.
Thus, each image is represented as a 3x3x2048 feature map (r = 3;d
v
= 2048). For
models without VAN, we extract global image features, and represent each image as a
2048 dimension (2048D) feature vector. For the query and keyword modalities, we remove
stop words and uncommon words in the raw data, convert all words to lowercase, and
2
https://stock.adobe.com
22
tokenize each word to its index in the corresponding dictionary. The dictionary sizes for
keyword modality in Clickture and ASD are 50234 and 27822. The dictionary sizes for
the query modality in Clickture and ASD are 85636 and 17388. We randomly split ASD
into three parts: 70% for training, 10% for validation and 20% for testing.
Compared approaches. We compare the following approaches for performance
evaluation:
(1) Ranking Canonical Correlation Analysis [121] (RC- CA) ranks images based on
a bilinear similarity function learned from clickthrough data. We adopt Resnet [42]
features for RCCA framework which achieves better performance compared to [121]
using AlexNet [59] features.
(2) Multimodal Bilinear Pooling (MB) combines visual and language modalities by
an outer production layer. Compared to multimodal compact bilinear pooling (MCB)
model [90], we drop the sketch count projection to avoid loss of information from original
modalities.
(3) Deep structured semantic model [47] (DSSM) learns a subspace to measure the
similarity between text modality and queries for document retrieval using a deep learn-
ing framework. We build similar structures which takes single image-related modality
for image search task. Specically, image modality (DSSM-Img) and keyword modality
(DSSM-Key) are evaluated.
Attention networks and AMC models. We compare dierent attention networks
as follows:
(1) VAN attends on informative regions in the image modality based on the query's
intent.
(2) LAN selects useful words in the keyword modality based on the query's intent.
(3) Late fusion network (LF) rst calculates the similarity scores between the input
query and each modality. To represent the nal correlation between the query and image-
related modalities, LF then combines these similaritiy scores by a linear transformation.
(4) MTN balances the importance of dierent modalities based on the query's intent.
23
Approach Img Key VAN LAN LF MTN
MB [90] X X
DSSM-Key [47] X
DSSM-Img [47] X
RCCA [121] X
Img
ATT
X X
Key
ATT
X X
Img
ATT
-Key
ATT
-LF X X X X X
AMC Full X X X X X
Table 3.1: Dierent models evaluated on Clickture and ASD. Language intra-attention
network (LAN) is applied on keyword modality. Visual intra-attention network (VAN) is
applied on image modality. Late fusion (LF) and multi-modal inter-attention networks
(MTN) are applied on multi-modalities.
Dierent models evaluated on Clickture dataset and ASD are listed in Table 3.1, with
details on adopted modalities and attention networks.
Training details. On Clickture dataset, we sample one negative tuple (v
;K
)
(t = 1) while on ASD, we sample 3 negative tuples (t = 3). Same as [121], the dimension
of embedding vectors in all modalities is 80 (d = 80). The batch size is set to 128. We
set margin = 1 in Eq 3.1.
Evaluation metrics. For Clickture dataset, we calculate NDCG@k score [121] for
top k2f5; 10; 15; 20; 25g ranking results for an input query. We exclude queries with
ranking list's size less than k for calculating NDCG@k score. The nal metric is the
average of all queries' NDCG@k in the Clickture dev set. We further compare dierent
models' performance under P@5 (precision at top 5 results), P@k, MAP and MRR met-
rics, whose details are described in [78]. ROC curves and Area Under Curve (AUC) are
also compared between dierent models on Clickture Dataset.
For ASD, we use Recall at k samples (R@k) as metric. Given a rank list, R@k is the
recall of positive samples (ratio of clicked images among all clicked images of the input
query) among the topk results. The nal metric is the average of all queries' R@k in the
ASD test set.
Performance on Clickture. The performances of dierent models on Clickture
dataset are shown in Tables 3.2, 3.3 and Fig 3.4. We rst apply intra-attention networks
on single modality models, which lters out unrelated information within each modality
24
Approach 5 10 15 20 25
MB 0.5643 0.5755 0.5873 0.5918 0.5991
DSSM-Key 0.5715 0.5745 0.5797 0.5807 0.5823
DSSM-Img 0.6005 0.6081 0.6189 0.6192 0.6239
RCCA 0.6076 0.6190 0.6293 0.6300 0.6324
Key
ATT
0.5960 0.6054 0.6168 0.6204 0.6241
Img
ATT
0.6168 0.6233 0.6308 0.6350 0.6401
Img
ATT
-Key
ATT
-LF 0.6232 0.6254 0.6344 0.6376 0.6444
AMC Full 0.6325 0.6353 0.6431 0.6427 0.6467
Table 3.2: Performance of dierent models on Clickture dataset. The evaluation metrics
are NDCG@5, 10, 15, 20, 25 (correspond to 2
nd
to 6
th
column). Fork2f5; 10; 10; 20; 25g,
we exclude queries with ranking list size less than k when we calculate NDCG@k.
Approach P@5 P@k MAP MRR AUC
MB 0.5615 0.6372 0.7185 0.7564 0.6275
DSSM-Key 0.5431 0.6756 0.6969 0.7884 0.5508
DSSM-Img 0.5835 0.6705 0.7308 0.7773 0.6455
RCCA 0.5856 0.6778 0.7332 0.7894 0.6384
AMC Full 0.6050 0.7069 0.7407 0.8067 0.6727
Table 3.3: Models' performance under dierent metrics
according to the query's intent. The resulting models, Key
ATT
and Img
ATT
, achieve 2.2%
and 2.6% increase in NDCG@5 compared to DSSM-Key and DSSM-Img, respectively.
Attention-guided single modality model Img
ATT
even beats the MB model with two
modalities information in NDCG metric. We further applies the late fusion network (LF)
on two attention-guided modalities. The resulting model Img
ATT
-Key
ATT
-LF achieves an
additional 1% increase in NDCG@5 compared to Img
ATT
and Key
ATT
, which validates
the eectiveness of learning a multi-modal subspace to further boost the image search
task. Finally, we apply MTN to select informative modalities based on the query's intent.
The AMC full model achieves the state-of-the-art performance on NDCG metric, with
more than 3% increase from single modality models, and 2.5% increase in NDCG@5
compared to RCCA model [121], which is3 times of RCCA's increase compared to the
previous state-of-the-art method.
We further evaluate AMC models under dierent metrics. In Table 3.3, AMC Full
model achieves obvious increases under all metrics. We show the ROC curves in Fig 3.4.
The AUC of AMC Full model has an increase of 3.4% compared to the state-of-the-art
25
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
True Positive Rate
RCCA
DSSM Img
DSSM Key
MB
AMC
Random
Figure 3.4: ROC curve for dierent models.
Approach R@1 R@5 R@10 R@15 R@20
DSSM-Img 0.0767 0.2778 0.4025 0.4617 0.4891
DSSM-Key 0.0980 0.3076 0.4207 0.4700 0.4926
Img
ATT
0.0782 0.2793 0.4049 0.4642 0.4918
Key
ATT
0.1042 0.3187 0.4322 0.4803 0.5019
Img
ATT
-Key
ATT
-LF 0.1106 0.3445 0.4620 0.5108 0.5327
AMC Full 0.1168 0.3504 0.4673 0.5148 0.5414
Table 3.4: Performance of dierent models on ASD. The evaluation metrics are R@1, 5,
10, 15, 20 (correspond to 2
nd
to 6
th
column).
method, which proves the eectiveness of the AMC learning method. Some visualization
results are shown in Fig 3.5.
Performance on ASD. We observe similar improvement by applying dierent at-
tention mechanisms on AMC models in Table 3.4. For intra-attention networks, LAN
(Key
ATT
) achieves 0.6-1.2% increase compared to DSSM-Key in R@k scores while VAN
(Img
ATT
) does not observe much improvement (0.2% increase in R@k scores). This
is because most images in ASD contain only one object in the center, which takes 70%
of the space with clean backgrounds. In such case, VAN can oer limited boost in per-
formance by focusing on informative regions. We then combine VAN and LAN using
LF. The resulting model, Img
ATT
-Key
ATT
-LF, achieves signicant improvement in R@k
scores, with 1.2-3.8% increase compared to DSSM-Key and 3.2-6.5% increase compared
26
Query: snooki baby bump
Visual: 0.6534
Language: 0.3466
transport, white, attractive,
buyer, object, elegance,
young, glamour, activity,
arm, speaker, woman,
shopper, photomodel,
seated, pregnant,
appearance, paint, drinking,
pretty, smile …
Query: snooki baby bump
Visual: 0.7128
Language: 0.2872
attractive, art, sunglasses,
breakage, elegance,
young, industrial, computer,
café, belly, woman, candy,
women, camera, cars,
stroll, paint, singer,
american, person, tourist,
arrival, people …
Query: silk twist hair
styles
Visual: 0.5028
Language: 0.4972
Query: silk twist hair
styles
Visual: 0.5631
Language: 0.4369
white, hair, lips, shawl,
human, attractive,
expression, glamour, lovely,
american, young, woman,
woman, eye, makeup,
hairstyle …
nature, white, art, guard,
color, rodent, event,
attractive, little, heritage,
dance, glamour, long, god,
young, veil, hair, haircut,
woman, eye, cut,
hairstyle …
Figure 3.5: Visualization of AMC model's VAN, LAN and MTN results. First column:
Input query and importance of visual and language modalities produced by MTN. Second
and third columns: original images and query-guided attention maps produced by VAN.
Fourth column: Some keywords highlighted by LAN.
to DSSM-Img. We further apply the MTN to attend on dierent modalities, and get the
AMC Full model. The AMC Full model achieves the best performance, with 0.6-1.0%
increase in R@k scores compared to late fusion model, 1.8-4.9% increase in R@k scores
compared to DSSM-Key and 3.8-7.1% increase in R@k scores compared to DSSM-Img.
Overtting. During training stage, we evaluate AMC models on test set every epoch.
The training loss rst reduces and converges at around epoch 12. The loss on test set
follows the similar trend and converges at around epoch 14 on both Clickture and ASD,
which indicates low possibility of overtting. We further apply AMC models on caption
ranking task which also achieves competitive performance.
3.4.2 Caption ranking
Experiment Setup. For visual modality, we apply a pre-trained 200-layer Resnet [42] to
extract image features as input. Each image is represented as a 2048D feature vector. To
27
Approach VGG Res LF MTN
Skip-Vgg [56] X
Skip-Vgg-Key-LF X X
AMC-Vgg X X
Skip-Res X
Skip-Res-Key-LF X X
AMC-Res X X
Table 3.5: Dierent models evaluated on CIC. Late fusion (LF) and inter-attention
(MTN) networks are applied on multi-modalities. Caption modality is represented by
Skip-thought vector (Skip). Image modality is represented by either VGG features (VGG)
or Resnet features (Res).
compare with [68], we also extract image features using a pre-trained 19-layer VGG [105]
network (4096D feature vector). For auto-tagged keywords, we remove stop words and
uncommon words in the raw data, convert all words to lowercase, and tokenize each
word to its index in the corresponding dictionary. The dictionary size for the keyword
modality is 26,806. For caption modality, we extract skip-thought vectors [56] using a
pre-trained model. Each caption is represented by a 4800D skip-thought vector. Same
as [56], embedding vectors in all modalities are projected to 1000 dimensions (d = 1000).
The similarity between query and features from dierent modalities is measured by cosine
distance in the AMC space.
AMC models. Same as the denotation in Sec 3.4.1, we apply latefusion (LF) and
inter-attention (MTN) mechanisms to combine features from image modality and keyword
modality (Key). Dierent AMC models' conguration is shown in Table 3.5.
Training details. We set margin = 0:2 and number of negative samples k = 50
for each correct caption-image (keyword) pair (Eq 3.9).
Evaluation Metric. We follow the evaluation metric reported in [1]. Same as
[1, 55, 56, 57, 73, 77], we report the caption retrieval performance on the rst 1,000 test
images. For a test image, the caption retrieval system needs to nd any 1 out of its 5
candidate captions from all 5,000 test captions. We report recall@(1, 5, 10), which is the
fraction of times a correct caption is found among the top (1, 5, 10) ranking results.
Performance comparison. AMC models provide very competitive results even
without a complex language model, e.g., recurrent neural network (RNN), convolutional
28
Approach R@1 R@5 R@10
Random 0.1 0.5 1.0
DVSA [1] 38.4 69.9 80.5
FV [57] 39.4 67.9 80.5
m-RNN-vgg [77] 41.0 73.0 83.5
m-CNN
ENS
[73] 42.8 73.1 84.1
Kiros et al. [55] 43.4 75.7 85.8
Skip-Vgg [56] 33.5 68.6 81.5
Skip-Vgg-Key-LF 34.2 69.3 82.0
AMC-Vgg 37.0 70.5 83.0
Skip-Res 39.5 73.6 86.1
Skip-Res-Key-LF 40.1 74.2 86.5
AMC-Res 41.4 75.1 87.8
Table 3.6: Performance of dierent models on CIC. The evaluation metrics are R@1, 5,
10(correspond to 2
nd
to 4
th
column). AMC models achieve competitive performance with
only skip-thought vectors for caption modality among all VQA-agnostic models.
neural network (CNN) or Gaussian mixture models (GMM), to process captions compared
to models in [1, 55, 56, 57, 73, 77]. In Table 3.6, we rst combine keyword and image
modalities using latefusion (Skip-Vgg-Key-LF). Skip-Vgg-Key-LF gives small improve-
ment in performance by0:6% in R@(1, 5, 10). This indicates that keyword modality
provides useful information but further care is needed to put it to better use. Thus,
we apply the inter-attention network (AMC-Vgg) to select informative modalities, which
boosts the performance by a large margin, with 3.5%, 1.9% and 1.5% increase in R@(1, 5,
10), respectively. We further change the image features to Resnet features, and observe
similar performance improvement as Vgg features. The nal model (AMC-Res), which
applies MTN on Resnet-based image modality and keyword modality, achieves very close
performance on R@1 as [73], on R@5 as [55] and even surpasses the state-of-the-art result
on R@10. We notice that AMC model does not achieve better results in R@5 compared
to [77, 73, 55]. This is because we adopt a relatively simple language model (Skip-thought
vector [56]) for captions, with base performance at 33.5% in R@5. Equipped with a more
complex RNN / CNN model to process caption modality, AMC models will expect further
boost in performance.
We notice that [68] reports much better results on the caption ranking task compared
to [1, 55, 56, 57, 73, 77]. However, the model in [68] is called \VQA-aware" model, which
29
encodes external VQA knowledge learned in the VQA task and fuses with the model
in [55]. AMC models, as well as models in [1, 55, 56, 57, 73, 77], belong to \VQA-
agnostic" models, which can be fused and enhanced by external VQA knowledge. We
expect to see further boost in performance of AMC models on caption ranking task when
the VQA knowledge data is made public.
30
Chapter 4
Mid-Level Multimodal Reasoning for Phrase Grounding
(Part I)
4.1 Introduction
Given an image and a natural language phrase as a query, phrase grounding attempts
to localize the mentioned objects in the image according to the query's specication. It
can be utilized in many daily life applications, such as electronic entertainment, early
education, security surveillance, etc. Solution to this problem can be an important build-
ing block for image-language related tasks, such as image captioning [51, 1, 20], visual
question answering [2, 11, 21] and image retrieval [37, 94].
Phrase grounding is a challenging problem in reasoning language queries and trans-
ferring their semantics to localize object in visual contents. To address the problem,
typically a set of proposal bounding boxes are rst generated as candidates by some pro-
posal generation system. The main diculties lie in how to correlate language input and
proposals' features and how to localize objects after learning such multimodal correla-
tion. State-of-the-art methods address the rst diculty by treating phrase grounding as
a ranking problem, which learn a multimodal subspace where relevance between visual
and language inputs are measurable and then rank proposals according to the relevance of
query's specication. Among them, Phrase-Region CCA [93] and SCRC [45] models learn
a subspace using Canonical Correlation Analysis (CCA) and a Recurrent Neural Network
(RNN) respectively. GroundeR [99] adopts an attention network, which learns a latent
subspace to attend on related proposals given dierent queries via phrase reconstruction.
31
These methods are suboptimal due to two problems. First, when the proposal gener-
ation system fails to provide good proposals which overlap the object mentioned by the
query with a large region (Fig. 5.1), these proposal-based models are unable to localize
the correct object. As a result, there exists a performance upper bound from proposal
generation systems. Second, these grounding systems consider query phrases as unrelated
to each other. However, sometimes dierent queries for the same image are semantically
related. For example, we observe query phrases for the same images are usually corre-
lated in same image-related captions in Flickr30K Entities [93] dataset. Intuitively, given
a query phrase, other phrases from the same sentence, which are the query's context,
can provide useful cues for grounding the correct objects. As shown in Fig. 5.1, with the
context \the tree", the grounding system should be able to infer the current query \a
man" does not refer to the tree object in the image, even though the proposal of tree also
has high condence; and the context \guitar" can provide hints for the system to nd \a
man" near the object \guitar".
A man is playing guitar under a tree.
Query: A man
Context: guitar, a tree
Step 2: Multimodal Spatial Regression
Language input: “A man”
Regress proposals based on query’s
semantic and visual features
Step 3: Context refinement
Language input: “A man”, “guitar”, “a
tree”
Refine choice of regression boxes by
parsing context information
No. 1
No.2
No. 3
No. 1: 0.40
No. 2: 0.55
No. 3: 0.05
No. 1: 0.40
No. 2: 0.55
No. 3: 0.05
Step 1: Proposal generation
Generate a set of proposals (red
bounding boxes) using a proposal system
Figure 4.1: Multimodal Spatial Regression with semantic Context (MSRC) system re-
gresses each proposal based on query's semantics and visual features. Besides, MSRC
takes advantage of context cues to lter out confusing candidates and rene regression
results. (Each regression box's ID corresponds to proposal box's ID, with condence on
the top-left corner.)
32
To address the aforementioned two issues, we propose a Multimodal Spatial Regression
with semantic Context (MSRC) deep network. MSRC system is composed of two parts:
a Spatial Regression Network (SRN) and a Context Renement Network (CRN). SRN
applies a bi-directional LSTM to encode query's intent and takes each proposal's feature
as visual information. It learns to predict the probability of each proposal being related to
query, and regresses each proposal to the mentioned object's location via a joint projection
in a multimodal subspace. SRN is robust to performance of proposal generation program,
because when proposal does not overlap much with the mentioned object in the query,
SRN can regress the proposal to best t the mentioned object. CRN takes a pair of queries
from the same sentence and jointly predicts proposals' probability of being relevant to
each query. Based on an assumption that dierent queries in the same sentence refer
to dierent objects, CRN adopts a joint prediction loss, which helps lter out confusing
proposal candidates. The nal prediction of MSRC is by a late fusion of SRN decision
and CRN decision, which takes advantages of both spatial regression results and context
information.
We evaluate the MSRC system on two popular phrase grounding datasets: Flickr30K
Entities [93] and Refer-it Game [52] datasets. Flickr30K Entities contains more than 30K
images associated with 5 captions for each image. There are 244K query phrases referring
to 276K manually annotated bounding boxes of objects in images. Every query phrase
comes from some image related caption. Refer-it Game has 130K query phrases, referring
to 96K objects, which are annotated for 19K images of natural scenes. We adopt ratio
of phrases which are successfully grounded by MSRC as metric. Experiments show that
MSRC system has more than 6% improvement in Flickr30K Entities and 5% improvement
in Refer-it Game datasets, indicating the eectiveness of our approach.
Our contributions are two fold: First, we propose a spatial regression approach in mul-
timodal space, which relieves performance limitation from proposal generation systems.
Second, we encode context information with query phrase by adopting a joint prediction
loss during training stage, which helps lter out confusing candidates during grounding.
33
A woman is cutting a tomato
with a man in the kitchen.
Visual input: proposal
bounding boxes
Query phrase: “A woman”
…
RNN
CNN
MNN
Multimodal
feature
FC
confidence
regression
parameters
Language
feature
Visual
feature
Query:
“A woman”
Context: a tomato, a man
…
…
Proposal
bounding boxes
SRN
SRN
SRN
Context N:
“a man”
Weight
shared
Weight
shared
L
cls
+L
reg
+ #Joint prediction loss
SRN
Query phrase: “A woman”
Visual input: proposal
bounding boxes
…
Context 1:
“a tomato”
CRN
finetune
Probability
Regression
parameters
fusion
(a)
(b)
(c)
(d)
Figure 4.2: Structure of MSRC system. (a) An example image and query phrases: For
query \A woman" (blue text), queries in red text are considered as its context, which
are further utilized by CRN. Input image is represented as a set of proposal bounding
boxes (green), and the ground truth for the query is the red box. (b) Structure of SRN:
SRN takes proposals and query phrase as inputs. Multimodal features are encoded by a
Multimodal Neural Network (MNN). SRN predicts each proposal's probability of being
related to the query as well as regression parameters to localize the mentioned object. (c)
Framework of MSRC: A SRN is rst trained and utilized to netune CRN later. CRN
renes probability predicted by SRN via encoding context information. (d): Structure of
CRN: Each (language, proposal set) pair has a SRN to predict condence. All SRNs share
weights during training. We propose a joint prediction loss to encode context information.
34
4.2 MSRC System
Multimodal Spatial Regression with semantic Context (MSRC) system contains two parts:
a Spatial Regression Network (SRN) and a Context Renement Network (CRN). We rst
introduce the framework of MSRC followed by structures of SRN and CRN. Then we
provide more details about training and grounding of MSRC system.
4.2.1 Framework
The visual input of SRN and CRN is a set of N proposal bounding boxesfr
i
g generated
from an input image I. Each proposal r
i
is represented as the visual feature x
i
2 R
dv
extracted by a pre-trained Convolutional Neural Network (CNN), where d
v
is the visual
feature's dimension. The language input of SRN is a query phrase q, while CRN takes
q and its context phrasesfp
q
i
g parsed from the same sentence as inputs. We adopt a
Bi-directional LSTM [44] to encode semantics of query q and its contextfp
q
i
g, which
are denoted as q2 R
d
l
and p
q
i
2 R
d
l
respectively, where d
l
is the language embedding
vector's dimension.
Givenfx
i
g and q, SRN predicts a probability distribution over the N proposals as
well as each proposal's regression parameters to infer the location of object mentioned by
the query phrase q. The objective is dened as:
arg min
s
X
q
L
s
cls
(fx
i
g; q) +L
s
reg
(fx
i
g; q)
(4.1)
where
s
is the parameters of SRN.L
s
cls
is a cross entropy function adopted for multiclass
classication task. L
s
reg
is a regression loss function, weighted by a hyper parameter ,
whose details are in Sec. 4.2.2
To further rene the grounding results, we propose a CRN to generate a more accurate
probability distribution over the N proposals via encoding context information. CRN is
based on an assumption: dierent phrases in one sentence refer to dierent objects in one
image, which is always true in Flickr30K Entities dataset [93]. Given N proposals' visual
35
featuresfx
i
g, language features of query q andM context phrasesfp
q
j
g, the objective of
CRN is dened as:
arg min
c
X
q
h
L
c
cls
+L
c
reg
+J(fx
i
g; q;fp
q
j
g)
i
(4.2)
where
c
is the parameters of CRN and ; are hyper parameters. L
c
cls
and L
c
reg
are
similar to L
s
cls
and L
s
reg
in Eq. (4.1). We propose a novel joint prediction loss function
J, penalizing prediction results which match the semantic of context phrasesfp
q
i
g. With
this term, CRN maximizes the margin of proposals between dierent queries in the same
sentence.
We rst train SRN and then netune CRN based on the trained weights of SRN.
During evaluation, we fuse the predicted probabilities from both SRN and CRN, and
regress the proposal with maximum probability according to the regression parameters
predicted by SRN, with more details in Sec. 4.2.4.
4.2.2 Spatial Regression Network (SRN)
As shown in Fig. 4.2, SRN concatenates language embedding vector q with each of the
proposal's visual feature x
i
. It then applies a network to generate multimodal features
fv
q
i
g2R
m
for each of thehq;r
i
i pair in am-dimensional subspace (We term the network
as MNN). The multimodal feature v
q
i
is calculated as:
v
q
i
='(W
m
(qjjx
i
) + b
m
) (4.3)
where W
m
2 R
(d
l
+dv )m
; b
m
2 R
m
are projection parameters of MNN. '(:) is a non-
linear activation function. \jj" denotes a concatenation operator. We also replace MNN
with a Multimodal Compact Bilinear Pooling layer in [21] to evaluate performances of
dierent multimodal features, with more details discussed in Sec. 4.3.
Given the multimodal feature v
q
i
, SRN predicts a 5D vector s
p
i
2R
5
for each proposal
r
i
via a linear projection (superscript \p" denotes prediction).
s
p
i
= W
s
v
q
i
+ b
s
(4.4)
36
where W
s
2R
d5
and b
s
2R
5
are projection weight and bias to be optimized. The rst
element in s
p
i
indicates the condence of r
i
being related to input query q's semantics.
We denotefs
p0
i
g as the probability distribution offr
i
g after we feedfs
p
i
[0]g to a softmax
function. During training, we choose the positive label of proposal as the one which
overlaps most with ground truth and with Intersection of Union (IoU)> 0:5. Thus, the
classication loss is calculated as:
L
s
cls
(fx
i
g; q) =s
p0
i
[0] log(s
p0
i
[0]) (4.5)
where i
is positive proposal's index in the proposal set.
The next four elements of s
p
i
record the regression information based on current loca-
tion of r
i
, which are dened as:
s
p
i
[1] = (x
pred
x
r
i
)=w
r
i
s
p
i
[2] = (y
pred
y
r
i
)=h
r
i
s
p
i
[3] = log(w
pred
=w
r
i
)
s
p
i
[4] = log(h
pred
=h
r
i
)
(4.6)
where [x
pred
;y
pred
;w
pred
;h
pred
] are the predicted regressed bounding box's center x, y
coordinates, width and height. Similarly, [x
r
i
;y
r
i
;w
r
i
;h
r
i
] is the location information of
r
i
.
Each proposal's ground truth regression data s
q
i
2R
4
is calculated in the same way
as Eq.(4.6), by replacing [x
pred
;y
pred
;w
pred
;h
pred
] with the ground truth bounding box's
location information. The regression loss for SRN is:
L
s
reg
(fx
i
g; q) =
1
4N
N
X
i=1
3
X
j=0
f (js
p
i
[j + 1] s
q
i
[j]j) (4.7)
wheref(:) is the smooth L1 loss function: f(x) = 0:5x
2
(x< 1), andf(x) =jxj 0:5(x
1).
37
4.2.3 Context Renement Network (CRN)
CRN is built on an assumption: dierent query phrases in one sentence usually refer to
dierent objects in the same image, which is common in Flickr30K Entities dataset [93].
Based on this assumption, CRN penalizes prediction results which match other queries'
semantics during training. In this way, CRN maximizes the margin of probability between
dierent queries from same sentences.
CRN takes proposals' visual featuresfx
i
g, embedding vector q of query q, and M
context phrases' embedding vectorsfp
q
j
g as inputs. As shown in Fig. 4.2, for each query
or its context phrase, CRN builds a SRN-like structure to process language embedding
and visual features, which shares weights with each other during training. We denote
the output of SRN for pairhq;fx
i
gi asft
q
i
g and for pairhp
j
;fx
i
gi asft
p
j
i
g. Besides,
we denote the proposal set S
q
as the proposals which overlap with ground truth with
IoU> 0:5. The cross-entropy loss and regression loss in Eq. (4.2) are calculated in the
same way as Eq. (5.5) and Eq. (5.6), while the last term in Eq. (4.2) is calculated as:
J(fx
i
g; q;fp
q
j
g) =
M
X
j=1
N
X
i=1
(i2S
q
)t
p
j
0
i
[0] log(t
p
j
0
i
[0]) (4.8)
where t
p
j
0
i
[0] is the softmax normalized probability generated from ft
p
j
i
[0]g. is an
indicator function to judge whether current label matches query q's semantic. If yes,
that means CRN should penalize this results, because the proposal should be belong to
context phrase rather than input query.
4.2.4 Training & Phrase grounding of MSRC
The number of context training samples is much smaller than that of single query phrases.
To compensate for this, we rst train the SRN using single queries with image proposals,
and then we netune the CRN by introducing context training samples and initialize
CRN structure by the pre-trained SRN. We use the Adam algorithm [54] to optimize the
deep learning framework, and adopt the rectied linear unit (ReLU) as the non-linear
activation function.
38
During phrase grounding stage, if an input query q has T
q
context phrases in the
same sentence, SRN rst predicts a probability distributionfPr
srn
(i)g over the propos-
als as well as each proposal's regression parametersfReg
srn
(i)g. Then CRN receives a
triplethq;fr
i
g;p
q
j
i successively and predicts a probability distribution over the proposals
fPr
crn
(ijp
q
j
)g. MSRC system gives the nal prediction result as:
Box
= Regression(Reg
srn
(i
);r
i
); where
i
= arg max
i
0
@
Pr
srn
(i) +
T
q
Tq
X
j=1
Pr
crn
(ijp
q
j
)
1
A
(4.9)
is a hyper parameter. Regression(.) denotes the regression operation based on the
input bounding box and regression parameters.
4.3 Experiments
We evaluate MSRC system on Flickr30K Entities [93] and Referit Game datasets [52] for
phrase grounding.
4.3.1 Datasets
Flickr30K Entities [93] contains 31,783 images, with 29783, 1000, 1000 images for
training, validation and testing respectively. Each image is associated with 5 captions.
There are 559,767 query phrases extracted from these captions referring to 276K manually
annotated bounding boxes in images. The vocabulary size of queries is 17,150. The
maximum length of query phrases is 19 words.
Referit Game [52] contains 19,894 images of natural scenes. There are 96,654 dis-
tinct objects in these images. Each of them is referred to by 1-3 query phrases (130,525 in
total). The vocabulary size of queries is 8800, and the maximum length of query phrases
is 19 words.
39
4.3.2 Experiment Setup
Proposal generation. We choose Selective Search [109] to generate proposals for
Flickr30K Entities and Edge Box [129] to generate proposals for Referit Game, which
are the same settings as in GroundeR [99] for fair comparison.
Visual feature extraction. We choose a VGG network [105] pre-trained on Ima-
geNet [17] to extract each proposal bounding box's visual feature, which is denoted as
\VGG
cls
", for both Flickr30K Entities and Referit Game datasets. Besides, we apply
a VGG network netuned by Fast-RCNN [33] on PASCAL VOC 2007 [19] dataset to
extract visual features for Flickr30K Entities, which are denoted as \VGG
det
".
To predict regression parameters, we need to include spatial information from each
proposal. For Flickr30K Entities, we augment each proposal's visual feature with its
spatial information [x
tl
=W;y
tl
=H;x
br
=W;y
br
=H;wh=WH] dened in [122]. We denote
these augmented features as \VGG
cls
-SPAT1" and \VGG
det
-SPAT1" for \VGG
cls
" and
\VGG
det
" respectively, each being a 4101D (d
v
= 4101) vector. For Referit Game dataset,
we augment each proposal's visual feature with its spatial information [x
min
;y
min
;x
max
,
y
max
;x
center
;y
center
;w
box
;h
box
] dened in [45]. We denote the augmented visual features
as \VGG
cls
-SPAT2", which are 4104D (d
v
= 4104) vectors.
Language encoding. We encode query's information by a Bi-directional LSTM [44].
We choose the last hidden state from the LSTM as output q (dimensiond
l
= 1000), which
is same as in [99].
Model initialization. We initialize the convolution layers with the MSRA method [41]
and initialize the fully-connected layers with the Xavier method [34]. We introduce batch
normalization layers after projecting visual features and language features, which is the
same setting as in [99].
Metric. We adopt the accuracy as an evaluation metric, dened to be the ratio of
phrases for which the regressed box overlaps with the ground-truth box by more than
50% IoU.
Compared approaches. We choose GroundeR [99], CCA embedding [93], MCB [21]
and the approach proposed by Wang et al. [113] for comparison, which all achieve state-
of-the-art performances in image grounding problem. For GroundeR [99], we focus on
40
its supervised learning scenario, which achieves the best performance among dierent
scenarios.
4.3.3 Performance on Flickr30K Entities
SRN model. During training, we set the Multimodal Neural Network (MNN) output
dimension as 128 (m = 128), which is same as [99]. By using the VGG
cls
-SPAT1 features,
we achieve 51.18% accuracy. Compared to GroundeR (VGG
cls
), SRN achieves 9.62%
increase in accuracy. Compared to VGG
cls
, VGG
det
focuses on object detection task,
which is more suitable for object localization. By using the VGG
det
feature for each
proposal, we further improve the performance to 55.99%. We also substitute the MNN
with a Multimodal Compact Bilinear pooling (MCB) layer in SRN, which is the model
MCB+Reg in Table 5.1. Experiment shows MCB+Reg has 2.32% increase compared to
MCB [21] model.
We test dierent output dimensions of MNN, which is discussed in later section (Ta-
ble 4.3). Experiments show that MNN+Reg achieves the best performance among mod-
els taking single query phrase as language input, with 8.29% improvement compared to
GroundeR [99], and 5.1% to state-of-the-art approach [93].
CRN model. We netune CRN based on the MNN+Reg SRN, and take VGG
det
-
SPAT1 as input. In training and testing stage, we treat other query phrases from the
same caption of the input query phrase as context. We set the input number of context
phrases for CRN to be 1 (M = 1) and the weight of joint prediction loss = 1:0. There
is slight improvement (0.32%) increase in accuracy.
MSRC System. For MSRC model's prediction, we fuse CRN's probability as well
as SRN's probability, select the proposal with maximum probability and then regress the
proposal according to the regression parameters predicted by SRN. We set the weights
= 1:0 in Eq. (4.9). The MSRC Full model achieves best performance on Flickr30K
Entities (57.53%), with 6.64% increase compared to [93].
Since Flickr30K Entities provides the phrase type for each query, we further compare
the detailed phrase localization results. In Table 5.7, we observe similar boosts in perfor-
mance by adopting SRN, CRN and fusion by MSRC as in Table 5.1. However, dierent
41
Approach Accuracy (%)
Compared approaches
SCRC [45] 27.80
Wang et al. [113] 42.08
GroundeR (VGG
cls
) [99] 41.56
GroundeR (VGG
det
) [99] 47.70
MCB [21] 48.69
CCA embedding [93] 50.89
Spatial regression models
MCB+Reg (VGG
det
-SPAT1) 51.01
MNN+Reg (VGG
cls
-SPAT1) 51.18
MNN+Reg (VGG
det
-SPAT1) 55.99
Context models
CRN (MNN+Reg(VGG
det
-SPAT1)) 56.31
MSRC Full 57.53
Table 4.1: Dierent models' performance on Flickr30K Entities. CRN is netuned based
on MNN with Regression layer and take VGG
det
-SPAT1 as input visual features
models' strengths are dierent. CCA embedding [93] model is strong in localizing \instru-
ments" while GroundeR [99] is better in localizing \scenes". By using SRN, we observe
that the regression network achieves increase in accuracy compared to GroundeR model
(VGG
det
). Typically, there is a large increase in performance of localizing \animals" and
\body parts" (with increase of 24% and 13% respectively). By using CRN, we observe
that the increase in \scene" is the largest. In the nal fusion stage, MSRC Full model
achieves more than 6.5%, 5.88%, 2.9% increase in accuracy in all categories (except \in-
strument" for CCA embedding [93]) compared to GroundeR [99], Wang et al. [113] and
CCA embedding [93] respectively.
Dimension of multimodal subspace in SRN. To nd the relation between SRN's
performance and multimodal subspace's dimension, we train and test SRN (MNN+Reg)
in ve dierent mulmodal subspace dimensions, which are m = 64; 128; 256; 512; 1024.
The performances are recorded in Table 4.3. From the results, we observe SRN has lower
performance when multimodal subspace has a smaller dimension, which may be caused by
lack of trainable parameters to exhibit the model's expressive power. When multimodal
42
Phrase Type people clothing body parts animals
GroundeR (VGG
cls
) [99] 53.80 34.04 7.27 49.23
Wang et al. [113] 57.89 34.61 15.87 55.98
CCA embedding [93] 64.73 46.88 17.21 65.83
SRN: MCB+Reg (VGG
det
-SPAT1) 62.75 43.67 14.91 65.44
SRN: MNN+Reg (VGG
det
-SPAT1) 67.38 47.57 20.11 73.75
CRN: MNN+Reg (VGG
det
-SPAT1) 68.24 47.98 20.11 73.94
MSRC Full 69.57 48.01 20.11 73.97
Phrase Type vehicles instruments scene other
GroundeR (VGG
cls
) [99] 58.75 22.84 52.07 24.13
Wang et al. [113] 52.25 23.46 34.22 26.23
CCA embedding [93] 68.75 37.65 51.39 31.77
SRN: MCB+Reg (VGG
det
-SPAT1) 65.25 24.74 64.10 34.62
SRN: MNN+Reg (VGG
det
-SPAT1) 72.44 29.34 63.68 37.88
CRN: MNN+Reg (VGG
det
-SPAT1) 73.66 29.34 66.00 38.32
MSRC Full 75.32 29.34 66.17 39.01
Table 4.2: Phrase grounding performances in dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage.
Multimodal dimension m 64 128 256 512 1024
MNN+Reg 51.21 55.99 54.59 55.31 55.97
Table 4.3: SRN MNN+Reg (VGG
det
-SPAT1) model's performance (accuracy in %) under
dierent dimension of multimodal subspace (weight of regression loss = 1:0).
subspace has larger dimensions, the performance
uctuates in a small scale, which re
ects
that SRN is insensitive to multimodal subspace's dimension when m is large.
Weight of regression loss. During training, SRN's loss in Eq. 4.1 is classication
loss plus a weighted regression loss. We test dierent weights, with = 0:5; 1:0; 2:0; 4:0 to
combine regression loss with classication loss. The results are shown in Table 4.4. From
the results, we observe that the hyper parameter does not have a big in
uence on SRN
if L
s
cls
and L
s
reg
are in the similar range. When is large, SRN loses useful information
contained in the classication part which helps SRN choose a good proposal to regress.
Thus, when = 10:0, we observe a decrease in performance.
43
Regression weight 0.5 1.0 2.0 4.0 10.0
MNN+Reg 54.04 55.99 55.68 55.32 54.12
Table 4.4: SRN MNN+Reg (VGG
det
-SPAT1) model's performance (accuracy in %) under
dierent weight of regression loss in Eq. 4.1 (multimodal subspace dimensionm = 128).
Approach Accuracy (%)
Compared approaches
SCRC [45] 17.93
GroundeR (VGG
cls
-SPAT2) [99] 26.93
Spatial Regression models
MCB+Reg (VGG
cls
-SPAT2) 26.54
MNN+Reg (VGG
cls
-SPAT2) 32.21
Table 4.5: Dierent models' performance on Refer-it Game. Since there is no context
information annotated, we only evaluate SRN models
4.3.4 Performance on Refer-it Game
SRN only. Based on Refer-it Game dataset's structure, there is no context information
labeled, because there are no image captions for each image. Hence, we only evaluate
SRN's performance on Refer-it Game dataset. In testing stage, we choose the proposal
with maximum probability predicted only by SRN. This is equal to setting = 0 in
Eq. (4.9).
Comparison of dierent approaches on Refer-it Game dataset is shown in Table 5.6.
From the results, we observe MCB does not have good performance as MNN. This is
likely due to MCB needs to maintain a large output dimension to exhibit the model's
expressive power. Refer-it Game does not have as much data as Flickr30K Entities. Thus,
the training procedure may overt in early stage. After using regression network, SRN
with MCB (MCB+Reg) has comparable performance with GroundeR [99]. By using
MNN (MNN+Reg), SRN achieves the highest performance, with 5.28% improvement
compared to the state-of-the-art method.
Dimension of multimodal subspace in SRN. Similar to Flick- r30K Entities, we
train and test SRN (MNN+Reg) in four dierent mulmodal subspace dimensions, which
arem = 64; 128; 256; 512. The performances are recorded in Table 4.6. From the results,
44
Multimodal dimension m 64 128 256 512
MNN+Reg 30.72 32.21 30.89 31.65
Table 4.6: SRN MNN+Reg (VGG
cls
-SPAT2) model's performance (accuracy in %) under
dierent dimension of multimodal subspace on Refer-it Game dataset. We x weight of
regression loss = 1:0.
Regression weight 0.5 1.0 2.0 4.0 10.0
MNN+Reg 31.65 32.21 31.56 31.69 32.04
Table 4.7: SRN MNN+Reg (VGG
cls
-SPAT2) model's performance (accuracy in %) under
dierent co-ecient of regression loss in Eq. 4.1. We x multimodal subspace dimension
m = 128.
we observe SRN has some
uctuation in accuracy. Overall it is insensitive to multimodal
subspace dimension m, which is similar to Table 4.3.
Weight of regression loss. We test dierent values of in Eq. (4.1). The results are
shown in Table 4.7. From the results, we nd that when is small, SRN's performance
is low, because classication part is mostly involved in training. When becomes large,
SRN's performance
uctuates and achieves the best at = 1:0, which is similar to results
in Table 4.4.
4.3.5 Qualitative results
We visualize some phrase grounding results of Flickr30K Entities and Refer-it Game
datasets (Fig. 5.3). For Flickr30K Entities, we show an image and its caption as well as
the queries in the caption. For each query, we visualize the ground truth box, the selected
proposal box by MSRC system and the regressed bounding box based on the regression
parameters predicted by SRN. Since there is no context information in Refer-it Game,
we visualize query and its ground truth box, with selected proposal and regressed box
predicted only by SRN.
45
A snowboarder clothed in
red is in the middle of a
jump from a snowy hill.
Query 1: A snowboarder Query 3: a snowy hill Query 2: red
Two people walk down a
city street that has writing
on it.
Query 1: Two people Query 2: a city street Query 3: writing
An african american woman
dressed in orange is hitting a
tennis ball with a racquet.
Query 1: An african
american woman
Query 2: orange Query 3: a racquet
Query 1: tree far left Query 2: tree on the right
side
Query 1: people Query 2: front window on
the left
Figure 4.3: Some phrase grounding results generated by MSRC system in Flickr30K and
Refer-it Game datasets. We visualize ground truth bounding box, selected proposal box
and regressed bounding box in blue, green and red resepctively. First three rows are
phrase grounding results in Flickr30K Entities dataset. First column is input image and
query phrases coming from the same image caption. The 2
nd
4
th
columns correspond to
dierent queries and grounding results. Forth row contains grounding results in Refer-it
Game dataset. For dierent queries, MSRC system is able to localize objects in same
images. However, when query is not clear without further context information, MSRC
system may ground wrong objects (image in row four, column four).
46
Chapter 5
Mid-Level Multimodal Reasoning for Phrase Grounding
(Part II)
5.1 Introduction
Given an image and a related textual description, phrase grounding attempts to localize
objects which are mentioned by corresponding phrases in the description. It is an impor-
tant building block in computer vision with natural language interaction, which can be
utilized in high-level tasks, such as image retrieval [6, 94], image captioning [1, 20] and
visual question answering [2, 11, 21].
Phrase Grounding is a challenging problem that involves parsing language queries
and relating the knowledge to localize objects in visual domain. To address this prob-
lem, typically a proposal generation system is rst applied to produce a set of proposals
as grounding candidates. The main diculties lie in how to learn the correlation be-
tween language (query) and visual (proposals) modalities, and how to localize objects
based on multimodal correlation. State-of-the-art methods address the rst diculty by
learning a subspace to measure the similarities between proposals and queries. With the
learned subspace, they treat the second diculty as a retrieval problem, where proposals
are ranked based on their relevance to the input query. Among these, Phrase-Region
CCA [93] and SCRC [45] models learn a multimodal subspace via Canonical Correlation
Analysis (CCA) and a Recurrent Neural Network (RNN) respectively. Varun et al. [84]
learn multimodal correlation aided by context objects in visual content. GroundeR [99]
47
A man is playing a guitar for a little girl.
Query: A man
Context: a guitar, a little girl
Box 1
Box 2
Box 3
Box 1
0.35
Box 2
0.46
Box 3
0.15
Language input: A man
Visual input: proposals from PGN
Box 4
Box 4
0.04
Foreground
Reward: 1.0
Context
Reward: 0.2
Background
Reward: 0.0
Step 3: CRN rewards top ranked proposals based on relationship with context mentioned objects,
and back propagates as policy gradients using reinforcement learning
Context
Reward: 0.2
Step 1, Step 2: Forward
Step 3: Back Propagation
Step 1: PGN generates a set of proposals
as candidates for QRN
Step 2: QRN regresses each proposal and
predicts its relevance to the query
(numbers in top-left of each box)
Figure 5.1: QRC Net rst regresses each proposal based on query's semantics and visual
features, and then utilizes context information as rewards to rene grounding results.
introduces an attention mechanism that learns to attend on related proposals given dif-
ferent queries through phrase reconstruction.
These approaches have two important limitations. First, proposals generated by in-
dependent systems may not always cover all mentioned objects given various queries;
since retrieval based methods localize objects by choosing one of these proposals, they
are bounded by the performance limits from proposal generation systems. Second, even
though query phrases are often selected from image descriptions, context from these de-
scriptions is not utilized to reduce semantic ambiguity. Consider example in Fig 5.1.
Given the query \a man", phrases \a guitar" and \a little girl" can be considered to
provide context that proposals overlapping with \a guitar" or \a little girl" are less likely
to be the ones containing \a man".
To address the aforementioned issues, we propose to predict mentioned object's loca-
tion rather than selecting candidates from limited proposals. We adopt a regression based
method guided by input query's semantics. To reduce semantic ambiguity, we assume
that dierent phrases in one sentence refer to dierent visual objects. Given one query
phrase, we evaluate predicted proposals and down-weight those which cover objects men-
tioned by other phrases (i.e., context). For example, we assign lower rewards for proposals
48
containing \a guitar" and \a little girl" in Fig 5.1 to guide system to select more discrim-
inative proposals containing \a man". Since this procedure depends on prediction results
and is non-dierentiable, we utilize reinforcement learning [107] to adaptively estimate
these rewards conditioned on context information and jointly optimize the framework.
In implementation, we propose a novel Query-guided Regression network with Context
policy (QRC Net) which consists of a Proposal Generation Network (PGN), a Query-
guided Regression Network (QRN) and a Context Policy Network (CPN). PGN is a
proposal generator which provides candidate proposals given an input image (red boxes
in Fig. 5.1). To overcome performance limit from PGN, QRN not only estimates each
proposal's relevance to the input query, but also predicts its regression parameters to the
mentioned object conditioned on the query's intent (yellow and green boxes in Fig. 5.1).
CPN samples QRN's prediction results and evaluates them by leveraging context infor-
mation as a reward function. The estimated reward is then back propagated as policy
gradients (Step 3 in Fig. 5.1) to assist QRC Net's optimization. In training stage, we
jointly optimize PGN, QRN and CPN using an alternating method in [97]. In test stage,
we x CPN and apply trained PGN and QRN to ground objects for dierent queries.
We evaluate QRC Net on two grounding datasets: Flickr30K Entities [93] and Referit
Game [52]. Flickr30K Entities contains more than 30K images and 170K query phrases,
while Referit Game has 19K images referred by 130K query phrases. Experiments show
QRC Net outperforms state-of-the-art methods by a large margin on both two datasets,
with more than 14% increase on Flickr30K Entities and 17% increase on Referit Game
in accuracy.
Our contributions are twofold: First, we propose a query-guided regression network
to overcome performance limits of independent proposal generation systems. Second,
we introduce reinforcement learning to leverage context information to reduce semantic
ambiguity.
5.2 QRC Network
QRC Net is composed of three parts: a Proposal Generation Network (PGN) to generate
candidate proposals, a Query-guided Regression Network (QRN) to regress and rank these
49
CNN
proposal bounding boxes
…
Proposal
generation loss
PGN
Query phrase: “A woman” LSTM
ROI
pooling
Concat
Description: A woman is
cutting a tomato with a
man in the kitchen.
Query: A woman
Context: a tomato, a man
Visual
input
MLP
Query
input
Regression
parameters
Probability
distribution
Classificati-
on loss
Regression
loss
QRN
Reward
Assign
Rank
Reward
loss
CPN
Context
labels
Reward
Function
FG
Reward:
1.0
BG
Reward:
0.0
context
Reward:
0.2
…
Figure 5.2: Query-guided Regression network with Context policy (QRC Net) consists
of a Proposal Generation Network (PGN), a Query-guided Regression Network (QRN)
and a Context Policy Network (CPN). PGN generates proposals and extracts their CNN
features via a RoI pooling operation [97]. QRN encodes input query's semantics by an
LSTM [44] model and regresses proposals conditioned on the query. CPN samples the
top ranked proposals, and assigns rewards considering whether they are foreground (FG),
background (BG) or context. These rewards are back propagated as policy gradients to
guide QRC Net to select more discriminative proposals.
candidates and a Context Policy Network (CPN) to further leverage context information
to rene ranking results. In many instances, an image is described by a sentence that
contains multiple noun phrases which are used as grounding queries, one at a time. We
consider the phrases that are not in the query to provide context; specically to infer
that they refer to objects not referred to by the query. This helps rank proposals; we use
CPN to optimize using a reinforcement learning policy gradient algorithm.
We rst present the framework of QRC Net, followed by the details of PGN, QRN and
CPN respectively. Finally, we illustrate how to jointly optimize QRC Net and employ
QRC Net in phrase grounding task.
5.2.1 Framework
The goal of QRC Net is to localize the mentioned object's locationy given an imagex and
a query phraseq. To achieve this, PGN generates a set ofN proposalsfr
i
g as candidates.
Given the query q, QRN predicts their regression parametersft
i
g and probabilityfp
i
g
of being relevant to the input query. To reduce semantic ambiguity, CPN evaluates
prediction results of QRN based on the locations of objects mentioned by context phrases,
50
and adopts a reward function F to adaptively penalize high ranked proposals containing
context-mentioned objects. Reward calculation depends on predicted proposals, and this
procedure is non-dierentiable. To overcome this, we deploy a reinforcement learning
procedure in CPN where this reward is back propagated as policy gradients [108] to
optimize QRN's parameters, which guides QRN to predict more discriminative proposals.
The objective for QRC Net is:
arg min
X
q
[L
gen
(fr
i
g) +L
cls
(fr
i
g;fp
i
g;y)
+L
reg
(fr
i
g;ft
i
g;y) +J()]
(5.1)
where denotes the QRC Net's parameters to be optimized and is a hyperparameter.
L
gen
is the loss for generation proposals produced by PGN.L
cls
is a multi-class classi-
cation loss generated by QRN in predicting the probabilityp
i
of each proposalr
i
.L
reg
is
a regression loss from QRN to regress each proposalr
i
to the mentioned object's location
y. J() is the reward expectation calculated by CPN.
5.2.2 Proposal Generation Network (PGN)
We build PGN with a similar structure as that of RPN in [97]. PGN adopts a fully
convolutional neural network (FCN) to encode the input image x as an image feature
map x. For each location (i.e., anchor) in image feature map, PGN uses dierent scales
and aspect ratios to generate proposalsfr
i
g. Each anchor is fed into a multiple-layer
perceptron (MLP) which predicts a probability p
o
i
estimating the objectness of the an-
chor, and 4D regression parameters t
i
= [(xx
a
)=w
a
; (yy
a
)=h
a
; log(w=w
a
); log(h=h
a
)]
as dened in [97]. The regression parameters t
i
estimate the oset from anchor to men-
tioned objects' bounding boxes. Given all mentioned objects' locationsfy
l
g, we consider
51
a proposal to be positive when it covers some objecty
l
with Intersection over Union (IoU)
> 0:7, and negative when IoU < 0:3. The generation loss is:
L
gen
(fr
i
g) =
1
N
cls
N
cls
X
i=1
(i2S
y
[S
y
) log(p
o
i
)
+
g
N
reg
Nreg
X
i=1
(i2S
y
)
3
X
j=0
f (jt
i
[j] t
i
[j]j)
(5.2)
where (:) is an indicator function, S
y
is the set of positive proposals' indexes and S
y
is the set of negative proposals' indexes. N
reg
is the number of all anchors and N
cls
is the number of sampled positive and negative anchors as dened in [97]. t
i
represents
regression parameters of anchori to corresponding object's locationy
l
. f(:) is the smooth
L1 loss function: f(x) = 0:5x
2
(jxj< 1), and f(x) =jxj 0:5(jxj 1).
We sample the topN anchors based onfp
o
i
g and regress them as proposalsfr
i
g with
predicted regression parameters t
i
. Through a RoI pooling operation [97], we extract
visual feature v
i
2 R
dv
for each proposal r
i
. fr
i
g andfv
i
g as fed into QRN as visual
inputs.
5.2.3 Query guided Regression Network (QRN)
For input query q, QRN encodes its semantics as an embedding vector q2 R
dq
via a
Long Short-Term Memory (LSTM) model. Given visual inputsfv
i
g, QRN concatenates
the embedding vector q with each of the proposal's visual feature v
i
. It then applies
a fully-connected (fc) layer to generate multimodal featuresfv
q
i
g2 R
m
for each of the
hq;r
i
i pair in an m-dimensional subspace. The multimodal feature v
q
i
is calculated as:
v
q
i
='(W
m
(qjjv
i
) + b
m
) (5.3)
where W
m
2 R
(dq +dv )m
; b
m
2 R
m
are projection parameters. '(:) is a non-linear
activation function. \jj" denotes a concatenation operator.
52
Based on the multimodal feature v
q
i
, QRN predicts a 5D vector s
p
i
2R
5
via a fc layer
for each proposal r
i
(superscript \p" denotes prediction).
s
p
i
= W
s
v
q
i
+ b
s
(5.4)
where W
s
2R
m5
and b
s
2R
5
are projection weight and bias to be optimized. The rst
element in s
p
i
estimates the condence ofr
i
being related to input queryq's semantics. The
next four elements are regression parameters which are in the same form as t
i
dened in
Sec. 5.2.2, wherex;y;w;h are replaced by regressed values andx
a
;y
a
;w
a
;h
a
are proposal's
parameters.
We denotefp
i
g as the probability distribution offr
i
g after we feedfs
p
i
[0]g to a softmax
function. Same as [99], we consider one proposal as positive which overlaps most with
ground truth and with IoU > 0:5. Thus, the classication loss is calculated as:
L
cls
(fr
i
g;fp
i
g;y) = log(p
i
) (5.5)
where i
is positive proposal's index in the proposal set.
Given the object's location y mentioned by query q, each proposal's ground truth
regression data s
q
i
2 R
4
is calculated in the same way as the last four elements of s
p
i
,
by replacing [x;y;w;h] with the ground truth bounding box's location information. The
regression loss for QRN is:
L
reg
(ft
i
g;fr
i
g;y) =
1
4N
N
X
i=1
3
X
j=0
f (js
p
i
[j + 1] s
q
i
[j]j) (5.6)
where f(:) is the smooth L1 function dened in Sec. 5.2.2.
5.2.4 Context Policy Network (CPN)
Besides using QRN to predict and regress proposals, we further apply a CPN to guide
QRN to avoid selecting proposals which cover the objects referred by query q's context
in the same description. CPN evaluates and assigns rewards for top ranked proposals
53
produced from QRN, and performs a non-dierentiable policy gradient [108] to update
QRN's parameters.
Specically, proposalsfr
i
g from QRN are rst ranked based on their probability dis-
tributionfp
i
g. Given the ranked proposals, CPN selects the top K proposalsfr
0
i
g and
evaluates them by assigning rewards. This procedure is non-dierentiable, since we do
not know the proposals' qualities until they are ranked based on QRN's probabilities.
Therefore, we use policy gradients reinforcement learning to update the QRN's param-
eters. The goal is to maximize the expectation of predicted reward F (fr
0
i
g) under the
distribution offr
0
i
g parameterized by the QRN, i.e., J = E
fp
i
g
[F ]. According to the
algorithm in [115], the policy gradient is
r
r
J =E
fp
i
g
[F (fr
0
i
g)r
r
logp
0
i
(
r
)] (5.7)
where
r
are QRN's parameters andr
r
logp
0
i
(
r
) is the gradient produced by QRN for
top ranked proposal r
i
.
To predict reward value F (fr
0
i
g), CPN averages top ranked proposals' visual features
fv
0
i
g as v
c
. The predicted reward is computed as:
F (fr
0
i
g) =(W
c
(v
c
jjq) + b
c
) (5.8)
where \jj" denotes concatenation operation and (.) is a sigmoid function. W
c
and b
c
are projection parameters which produce a scalar value as reward.
To train CPN, we design a reward function to guide CPN's prediction. The reward
function performs as feedback from environment and guide CPN to produce meaningful
policy gradients. Intuitively, to help QRN select more discriminative proposals related
to query q rather than context, we assign lower reward for some top ranked proposal
that overlaps the object mentioned by context and higher reward if it overlaps with the
mentioned object by query. Therefore, we design the reward function as:
R(fr
0
i
g) =
1
K
K
X
i=1
[(r
0
i
2S
q
) +(r
0
i
= 2 (S
q
[S
bg
))] (5.9)
54
where S
q
is the set of proposals with IoU > 0:5 with mentioned objects by query q, and
S
bg
is the set of background proposals with IoU < 0:5 with objects mentioned by all
queries in the description. (.) is an indicator function and 2 (0; 1) is the reward for
proposals overlapping with objects mentioned by context. The reward prediction loss is:
L
rwd
(fr
0
i
g) =jjF (fr
0
i
g)R(fr
0
i
g)jj
2
(5.10)
During training,L
rwd
is backpropagated only to CPN for optimization, while CPN back-
propagates policy gradients (Eq. 5.7) to optimize QRN.
5.2.5 Training and Inference
We train PGN based on an RPN pre-trained on PASCAL VOC 2007 [19] dataset, and
adopt the alternating training method in [97] to optimize PGN. We rst train PGN and
use proposals to train QRN and CPN, then initialize PGN tuned by QRN and CPN's
training, which iterates one time. Same as [99], we select 100 proposals produced by
PGN (N = 100) and select top 10 proposals (K = 10) predicted by QRN to assign
reward in Eq. 5.9. After calculating policy gradient in Eq. 5.7, we jointly optimize QRC
Net's objective (Eq. 5.1) using Adam algorithm [54]. We choose the rectied linear unit
(ReLU) as the non-linear activation function.
During testing stage, CPN is xed and we stop its reward calculation. Given an image,
PGN is rst applied to generate proposals and their visual features. QRN regresses these
proposals and predicts the relevance of each proposal to the query. The regressed proposal
with highest relevance is selected as the prediction result.
5.3 Experiment
We evaluate QRC Net on Flickr30K Entities [93] and Referit Game datasets [52] for
phrase grounding task.
55
5.3.1 Datasets
Flickr30K Entities [93]: The numbers of training, validation and testing images are
29783, 1000, 1000 respectively. Each image is associated with 5 captions, with 3.52 query
phrases in each caption on average. There are 276K manually annotated bounding boxes
referred by 360K query phrases in images. The vocabulary size for all these queries is
17150.
Referit Game [52] consists of 19,894 images of natural scenes. There are 96,654
distinct objects in these images. Each object is referred to by 1-3 query phrases (130,525
in total). There are 8800 unique words among all the phrases, with a maximum length
of 19 words.
5.3.2 Experiment Setup
Proposal generation. We adopt a PGN (Sec. 5.2.2) to generate proposals. Dur-
ing training, we optimize PGN based on an RPN pre-trained on PASCAL VOC 2007
dataset [19], which does not overlap with Flickr30K Entities [93] or Referit Game [52].
We also evaluate QRC Net based on Selective Search [109] (denoted as \SS") and Edge-
Boxes [129] (denoted as \EB"), and an RPN [97] pre-trained on PASCAL VOC 2007 [91]
(denoted as \RPN"), which are all independent of QRN and CPN.
Visual feature representation. For QRN, the visual features are directly generated
from PGN via a RoI pooling operation. Since PGN contains a VGG Network [105] to pro-
cess images, we denote these features as \VGG
pgn
". To predict regression parameters, we
need to include spatial information for each proposal. For Flickr30K Entities, we augment
each proposal's visual feature with its spatial information [x
tl
=W;y
tl
=H;x
br
=H;y
br
=W ,
wh=WH] as dened in [122]. These augmented features are 4101D vectors (d
v
= 4101).
For Referit Game, we augment VGG
pgn
with each proposal's spatial information [x
min
;y
min
,
x
max
;y
max
;x
center
;y
center
;w
box
;h
box
] which is same as [99] for fair comparison. We denote
these features as \VGG
pgn
-SPAT", which are 4104D vectors (d
v
= 4104).
To compare with other approaches, we replace PGN with a Selective Search and an
EdgeBoxes proposal generator. Same as [99], we choose a VGG network netuned using
Fast-RCNN [33] on PASCAL VOC 2007 [19] to extract visual features for Flickr30K
56
Entities. We denote these features as \VGG
det
". Besides, we follow [99] and apply a VGG
network pre-trained on ImageNet [17] to extract proposals' features for Flickr30K Entities
and Referit Game, which are denoted as \VGG
cls
". We augment VGG
det
and VGG
cls
with spatial information for Flickr30K Entities and Referit Game datasets following the
method mentioned above.
Model initialization. Following same settings as in [99], we encode queries via
an LSTM model, and choose the last hidden state from LSTM as q (dimension d
q
=
1000). All convolutional layers are initialized by MSRA method [41] and all fc layers
are initialized by Xavier method [34]. We introduce batch normalization layers after
projecting visual and language features (Eq. 5.3).
During training, the batch size is 40. We set weight for regression loss L
reg
as 1.0
(Eq. 5.1), and reward value = 0:2 (Eq. 5.9). The dimension of multimodal feature vector
v
q
i
is set to m = 512 (Eq. 5.3). Analysis of hyperparameters is provided in Sec. 5.3.3
and 5.3.4.
Metric. Same as [99], we adopt accuracy as the evaluation metric, dened to be the
ratio of phrases for which the regressed box overlaps with the mentioned object by more
than 50% IoU.
Compared approaches. We choose GroundeR [99], CCA embedding [93], MCB [21],
Structured Matching [113] and SCRC [45] for comparison, which all achieve leading per-
formances in phrase grounding. For GroundeR [99], we compare with its supervised
learning scenario, which achieves the best performance among dierent scenarios.
5.3.3 Performance on Flickr30K Entities
Comparison in accuracy. We rst evaluate QRN performance based on dierent
independent proposal generation systems. As shown in Table 5.1, by adopting QRN,
RPN+QRN achieves 14.35% increase compared to RPN+GorundeR. We further improve
QRN's performance by adopting Selective Search (SS) proposal generator. Compared
to SS+GroundeR, we achieve 8.18% increase in accuracy. We then incorporate our own
PGN into the framework, which is jointly optimized to generate proposals as well as
features (VGG
pgn
). By adopting PGN, PGN+QRN achieves 4.22% increase compared
57
Approach Accuracy (%)
Compared approaches
SCRC [45] 27.80
Structured Matching [113] 42.08
SS+GroundeR (VGG
cls
) [99] 41.56
RPN+GroundeR (VGG
det
) [99] 39.13
SS+GroundeR (VGG
det
) [99] 47.81
MCB [21] 48.69
CCA embedding [93] 50.89
Our approaches
RPN+QRN (VGG
det
) 53.48
SS+QRN (VGG
det
) 55.99
PGN+QRN (VGG
pgn
) 60.21
QRC Net (VGG
pgn
) 65.14
Table 5.1: Dierent models' performance on Flickr30K Entities. Our framework is eval-
uated by combining with various proposal generation systems.
Proposal generation RPN [97] SS [109] PGN
UBP (%) 71.25 77.90 89.61
BPG 7.29 3.62 7.53
Table 5.2: Comparison of dierent proposal generation systems on Flickr30k Entities
to independent proposal generation system (SS+QRN) in accuracy. Finally, we include
CPN to guide QRN in selecting more discriminative proposals during training. The full
model (QRC Net) achieves 4.93% increase compared to PGN+QRN, and 14.25% increase
over the state-of-the-art CCA embedding [93] in accuracy.
Detailed comparison. Table 5.7 provides the detailed phrase localization results
based on the phrase type information for each query in Flickr30K Entities. We can ob-
serve that QRC Net provides consistently superior results. CCA embedding [93] model
is good at localizing \instruments" while GroundeR [99] is strong in localizing \scene".
By using QRN, we observe that the regression network achieves consistent increase in
accuracy compared to GroundeR model (VGG
det
) in all phrase types except for class
\instruments". Typically, there is a large increase in performance of localizing \animals"
(with increase of 11.39%). By using PGN, we observe that PGN+QRN has surpassed
state-of-the-art method in all classes, with largest increase in class \instruments". Finally,
58
Weight 0.5 1.0 2.0 4.0 10.0
Accuracy (%) 64.15 65.14 64.40 64.29 63.27
Table 5.3: QRC Net's performances on Flickr30K Entities for dierent weights ofL
reg
.
Dimension m 128 256 512 1024
Accuracy (%) 64.08 64.59 65.14 62.52
Table 5.4: QRC Net's performances on Flickr30K Entities for dierent dimensions m of
v
q
i
.
by applying CPN, QRC Net achieves more than 8.03%, 9.37%, 8.94% increase in accu-
racy in all categories compared to CCA embedding [93], Structured Matching [113] and
GroundeR [99] respectively. QRC Net achieves the maximum increase in performance
of 15.73% for CCA embedding [93] (\scene"), 32.90% for Structured Matching [113]
(\scene") and 21.46% for GroundeR [99] (\clothing").
Proposal generation comparison. We observe proposals' quality plays an impor-
tant role in nal grounding performance. The in
uence has two aspects. First is the
Upper Bound Performance (UBP) which is dened as the ratio of covered objects by
generated proposals in all ground truth objects. Without regression mechanism, UBP
directly determines the performance limit of grounding systems. Another aspect is the
average number of surrounding Bounding boxes Per Ground truth object (BPG). Gener-
ally, when BPG increases, more candidates are considered as positive, which reduces the
diculty for following grounding system. To evaluate UBP and BPG, we consider that a
proposal covers the ground truth object when its IoU > 0:5. The statistics for RPN, SS
and PGN in these two aspects are provided in Table 5.2. We observe that PGN achieves
increase in both UBP and PBG, which indicates PGN provides high quality proposals
for QRN and CPN. Moreover, since QRN adopts a regression-based method, it can sur-
pass UBP of PGN, which further relieves the in
uence from UBP of proposal generation
systems.
Hyperparameters. We evaluate QRC Net for dierent sets of hyperparameters. To
evaluate one hyperparameter, we x other hyperparameters to default values in Sec. 5.3.2.
59
Reward 0.1 0.2 0.4 0.8
Accuracy (%) 64.10 65.14 63.88 62.77
Table 5.5: QRC Net's performances on Flickr30K Entities for dierent reward values
of CPN.
Approach Accuracy (%)
Compared approaches
SCRC [45] 17.93
EB+GroundeR (VGG
cls
-SPAT) [99] 26.93
Our approaches
EB+QRN (VGG
cls
-SPAT) 32.21
PGN+QRN (VGG
pgn
-SPAT) 43.57
QRC Net (VGG
pgn
-SPAT) 44.07
Table 5.6: Dierent models' performance on Referit Game dataset.
We rst evaluate QRC Net's performance for dierent regression loss weights . The
results are shown in Table 5.3. We observe the performance of QRC Net
uctuates when
is small and decreases when becomes large.
We then evaluate QRC Net's performance for dierent dimensions m for multimodal
features in Eq. 5.3. The performances are presented in Table 5.4. We observe QRC Net's
performance
uctuates whenm< 1000. Whenm becomes large, the performance of QRC
Net decreases. Basically, these changes are in a small scale, which shows the insensitivity
of QRC Net to these hyperparameters.
Finally, we evaluate dierent reward values for proposals covering objects mentioned
by context. We observe QRC Net's performance
uctuates when < 0:5. When is close
to 1.0, the CPN assigns almost same rewards for proposals covering ground truth objects
or context mentioned objects, which confuses the QRN. As a result, the performance of
QRC Net decreases.
5.3.4 Performance on Referit Game
Comparison in accuracy. To evaluate QRN's eectiveness, we rst adopt an indepen-
dent EdgeBoxes [129] (EB) as proposal generator, which is same as [99]. As shown in
Table 5.6, by applying QRN, we achieve 5.28% improvement compared to EB+GroundeR
60
Query 1: woman in funny
hat on right
(Referit Game does not
provide image descriptions)
A young tennis player
wearing a yellow shirt and
shorts hits the tennis ball
Query 1: A young tennis playe r Query 2 : yellow shirt Query 3 : tennis ball
A little girl with a purple
beanie waits near a
domino’s pizza sign
Query 1: A little girl Query 2 : purple beanie Query 3 : domino’s pizza sign
Query 2 : leftmost person Query 3 : hat on right
Figure 5.3: Some phrase grounding results in Flickr30K Entities [93] (rst two rows) and
Referit Game [52] (third row). We visualize ground truth bounding box, selected proposal
box and regressed bounding box in blue, green and red resepctively. When query is not
clear without further context information, QRC Net may ground wrong objects (e.g.,
image in row three, column four).
model. We further incorporate PGN into the framework. PGN+QRN model brings
11.36% increase in accuracy, which shows the high quality of proposals produced by
PGN. Finally, we evaluate the full QRC Net model. Since Referit Game dataset only
contains independent query phrases, there is no context information available. In this
case, only the rst term in Eq. 5.9 guides the learning. Thus, CPN does not contribute
much to performance (0.50% increase in accuracy).
Hyperparameters. We evaluate QRC Net's performances for dierent hyperparam-
eters on Referit Game dataset. First, we evaluate QRC Net's performance for dierent
61
Phrase Type people clothing body parts animals
GroundeR (VGG
cls
) [99] 53.80 34.04 7.27 49.23
GroundeR (VGG
det
) [99] 61.00 38.12 10.33 62.55
Structured Matching [113] 57.89 34.61 15.87 55.98
CCA embedding [93] 64.73 46.88 17.21 65.83
SS+QRN 68.24 47.98 20.11 73.94
PGN+QRN 75.08 55.90 20.27 73.36
QRC Net 76.32 59.58 25.24 80.50
Phrase Type vehicles instruments scene other
GroundeR (VGG
cls
) [99] 58.75 22.84 52.07 24.13
GroundeR (VGG
det
) [99] 68.75 36.42 58.18 29.08
Structured Matching [113] 52.25 23.46 34.22 26.23
CCA embedding [93] 68.75 37.65 51.39 31.77
SS+QRN 73.66 29.34 66.00 38.32
PGN+QRN 68.95 45.68 65.27 38.80
QRC Net 78.25 50.62 67.12 43.60
Table 5.7: Phrase grounding performances for dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage.
Weight 0.5 1.0 2.0 4.0 10.0
Accuracy (%) 43.71 44.07 43.61 43.60 42.75
Table 5.8: QRC Net's performances on Referit Game for dierent weights of L
reg
.
weights of regression lossL
reg
. As shown in Table 5.8, performance of QRC Net
uctu-
ates when is small. When becomes large, regression loss overweights classication loss,
where a wrong seed proposal may be selected which produces wrong grounding results.
Thus, the performance decreases.
We then evaluate QRC Net's performance for dierent multimodal dimensionsm of v
q
i
in Eq. 5.3. In Table 5.9, we observe performance changes in a small scale whenm< 1000,
and decreases when m> 1000.
5.3.5 Qualitative Results
We visualize some phrase grounding results of Flickr30K Entities and Referit Game for
qualitative evaluation (Fig. 5.3). For Flickr30K Entities, we show an image with its
62
Dimension m 128 256 512 1024
Accuracy (%) 42.95 43.80 44.07 43.51
Table 5.9: QRC Net's performances on Regerit Game for dierent dimensions m of v
q
i
.
associated caption, and highlight the query phrases in it. For each query, we visualize
the ground truth box, the selected proposal box by QRN and the regressed bounding
box based on the regression parameters predicted by QRN. Since there is no context
information in Referit Game, we visualize query and ground truth box, with selected
proposal and regressed box predicted by QRN.
As shown in Fig 5.3, QRC Net is strong in recognizing dierent people (\A young
tennis player" in the rst row) and clothes (\purple beanie" in the second row), which is
also validated in Table 5.7. However, when the query is ambiguous without further context
description, QRC Net may be confused and produce reasonably incorrect grounding result
(e.g., \hat on the right" in the third row of Fig. 5.3).
63
Chapter 6
Mid-Level Multimodal Reasoning for Phrase Grounding
(Part III)
6.1 Introduction
Given an image and a natural language query, phrase grounding aims to localize objects
mentioned by the query. It is a fundamental building block for many high-level computer
vision tasks such as image retrieval [6], image QA [11, 23, 24] and video QA [28, 29].
Traditionally, training a good phrase grounding system requires large amounts of manual
annotations indicating the mapping between input queries and mentioned objects in im-
ages; these are time-consuming to acquire and suer from potential human errors. This
motivates us to address the problem of training a grounding system by weakly supervised
training data where objects of interest are mentioned in language queries but are not
delineated in images.
Phrase grounding is dicult as both visual and language modalities are ambiguous
and we need to reason about both to nd their correspondences. To address this prob-
lem, typically a proposal generation system is applied to the input image to produce a
set of candidate regions (i.e., proposals). Phrase grounding task is then treated as a
retrieval problem to search the most query-related proposals. Based on this, attention
mechanisms [8, 10, 99, 117] are learned to adaptively attend to mentioned objects for
input queries.
Training a phrase grounding system with weakly supervised data brings additional
challenge as no direct mappings between the two modalities are provided. Consider
64
Candidate Proposals Location Prediction
Query Reconstruction
A man playing soccer
Visual
Consistency
Language
Consistency
Knowledge
Phrase Grounding Model
Visual
Language
Model Annotations
Visual
Model
Reconstruct
Language
Consistency
(a) (b)
Input Query
A man playing football
Prob.
0.78
Prob.
0.54
Prob.
0.12
(c)
Language Language
…
0.2
0.7
0.4
Figure 6.1: (a) supervised grounding systems, (b) state-of-the-art weakly supervised
grounding systems guided by language consistency, (c) KAC Net applies both visual and
language consistency and leverages complementary knowledge from the visual feature
extractor to facilitate weakly supervised grounding.
Fig. 6.1(c) where we encode the query as an embedding vector and extract visual features
for a set of object proposals from the image. To nd correct mappings between the
query and the proposals, [99] proposes to associate the query with successive proposals;
once a proposal is selected, a phrase is reconstructed from it and evaluated for language
consistency with the input query. [117] adopts continuous attention maps and explores
to reconstruct the structure of input query as well as its context.
We introduce two new concepts to overcome challenges of weakly supervised training.
First is that pre-trained, xed category detectors can provide useful knowledge in selecting
the proposals that should be attended to. Second is that the detector knowledge enables
us to evaluate visual consistency, in addition to language consistency. This knowledge
also helps improve language consistency analysis.
We observe that if a pre-trained Convolutional Neural Network (CNN) (e.g., VGG [105])
is applied to extract visual features for proposals, it can also naturally produce a proba-
bility distribution of the categories of the proposals, as this is the task that the network
was trained on (e.g. MSCOCO [66] classication). This free distribution can be treated
as complementary external knowledge to lter out, or downweight, proposals that are
unrelated to the query. For example, in Fig. 6.1(c), given a query \a man playing foot-
ball", a pre-trained VGG network can provide useful hints for candidate proposals by
predicting whether a proposal corresponds to a high probability \people" detection.
65
Use of external knowledge in language consistency is straight-forward; features for
reconstruction can be modied by the detection probabilities. Task of evaluating visual
consistency is more dicult; a direct analogy to language consistency would be to convert
visual proposal to words and reconstruct image patches. Instead, we propose to predict
object locations from query and visual features to match the goal of phrase grounding.
This process would be not possible without the aid of external knowledge that helps focus
on the possible related proposals for prediction.
In implementation, we construct a novel Knowledge Aided Consistency Network (KAC
Net) which consists of two branches: a visual consistency branch and a language con-
sistency branch. These two branches are joined by a shared multimodal subspace where
the attention model is applied. To leverage complementary knowledge from visual fea-
ture extractor, we propose a novel Knowledge Based Pooling (KBP) gate to focus on
query-related proposals for visual and language reconstruction.
We evaluate KAC Net on two grounding datasets: Flickr30K Entities [93] and Referit
Game [52]. Flickr30K Entities contains more than 30K images and 170K query phrases,
while Referit Game has 19K images referred by 130K query phrases. We ignore bound-
ing box annotations during training in weakly supervised scenario. Experiments show
KAC Net outperforms state-of-the-art methods by a large margin on both two datasets,
with more than 9% increase on Flickr30K Entities and 5% increase on Referit Game in
accuracy.
Our contributions are twofold: First, we leverage complementary knowledge to l-
ter out unrelated proposals and provide direct guidance. Second, we propose a visual
consistency to boost grounding performance.
Knowledge transfer is a technique widely used for tasks in dierent domains. Hin-
ton et al. [43] propose to compress knowledge learned from one model into another one
which is too computationally expensive to train. Inspired by this, Aytar et al. [3] apply
visual knowledge to train a sound classication network. Owens et al. [89] use ambient
sound information to train an object detection network. Lin et al. [68] leverage knowledge
learned in Visual Question Answering (VQA) task in image retrieval. Zhang et al. [125]
apply knowledge learned in image captioning and VQA to train a network detecting visual
66
…
Proposal
Generation
CNN
LSTM
KBP
Gate
Visual
features
MLP
Language
Consistency loss
Visual
Consistency loss
Visual Modality
Language Modality
Query: “A man playing
skateboard trick”
Language
Encoding
Attention
Model
Multimodal
space
Know-
ledge
…
people
plant
vehicle
…
…
…
…
LSTM
Attention score: 0.72
xmin = 247 ymin = 42
height = 121, width = 115
Attention score: 0.54
xmin = 372 ymin = 123
height = 82, width = 68
Reconstructed visual
information & Attention
weighted visual features
Visual Info.
Alignment
Visual Information
of Proposals
Figure 6.2: Knowledge Aided Consistency Network (KAC Net) consists of a visual con-
sistency branch and a language consistency branch. Visual consistency branch aims at
predicting and aligning query-related proposals' location parameters conditioned on the
input query. Language consistency branch attempts to reconstruct input query from
query-related proposals. To provide guidance in training and testing, a Knowledge Based
Pooling (KBP) gate is applied to lter out unrelated proposals for both branches.
relation in images. For phrase grounding, we propose to leverage knowledge learned from
pre-trained deep neural network to lter out unrelated proposals for visual consistency.
6.2 KAC Network
KAC Net consists of two branches: a visual consistency branch and a language consis-
tency branch which reconstructs visual and language information respectively. The two
branches are joined in a shared multimodal subspace, where an attention model is ap-
plied to attend on mentioned objects based on query's semantics. To leverage external
knowledge from pre-trained CNN feature extractor, a Knowledge Based Pooling (KBP)
gate is proposed to select query-related proposals. KAC Net is trained end-to-end, with
both visual and language consistency restriction to guide the training.
We rst introduce the framework of KAC Net, followed by the details of KBP gate.
Then we illustrate how KBP is applied to facilitate the optimization of visual and language
consistency branches. Finally, more details of training and inference are provided.
67
6.2.1 Framework
The goal of KAC Net is to localize the mentioned object y given a query phrase q and
an image x. To address the problem, a set of N proposalsfr
i
g are generated via an
object proposal generation system. An attention model is then applied to attend on the
proposal r
q
which contains the mentioned object y based on the semantics of query q.
In weakly supervised scenario, the mapping between query q and the location of
mentioned object y is not provided. To learn the attention model, we adopt visual and
language consistency and construct two branches respectively. For language consistency,
a reconstruction model is applied to reconstruct input query q given the query-related
proposals predicted by the attention model. According to the language consistency, the
reconstructed query should be consistent with the input. A language consistency lossL
lc
is generated by comparing the reconstructed and original queries.
For visual consistency, we propose to reconstruct visual information for query-related
proposals. Since the goal of phrase grounding is to predict mentioned object's location,
we choose to predict candidate proposals' location parameters conditioned on the input
query. Similar to language consistency, visual consistency requires that the predicted
parameters should recover each proposal's location. Based on this, a visual consistency
lossL
vc
is produced by calculating the dierence between the predicted and original
proposals' location parameters.
To leverage rich image features and available xed category classiers, we apply KBP
to encode knowledge provided by CNN and weight each proposal's importance in visual
and language consistency. The objective of KAC Net can be written as
arg min
X
q
(L
k
lc
+L
k
vc
) +L
reg
(6.1)
where denotes the parameters to be optimized. L
k
lc
is the reconstruction loss from
language consistency branch andL
k
vc
is the reconstruction loss from visual consistency
branch (superscript \k" refers to KBP).L
reg
is a weight regularization term. , are
hyperparameters.
68
CNN
category tree people car table …
p
i
0.02 0.84 0.07 0.01 …
Pre-trained CNN prediction: people
Query: A smiling woman
k
"
#
=similarity (“woman”, “people”) = 0.72
Figure 6.3: A pre-trained CNN always predicts a probability distribution for its own
task. We leverage the most probable category predicted by CNN and calculate the word
similarity between noun words in the query as knowledge k
q
i
6.2.2 Knowledge Based Pooling (KBP)
We apply a pre-trained CNN to extract visual feature v
i
for a proposal r
i
, and predict
a probability distribution p
i
for its own task, which provides useful cues to lter out
unrelated proposals.
To encode this knowledge, we rst parse the language query and retrieve all the noun
words via a Natural Language Processing (NLP) parser. For each proposal's distribution
p
i
, we select the most probable class with the highest probability. The knowledge k
q
i
for
proposal r
i
is then calculated as the word similarity between the name of this class and
noun words in the query (Fig. 6.3). If a query contains multiple noun words, we average
all the calculated similarities as the knowledge k
q
i
, which can be written as
k
q
i
=
1
N
q
Nq
X
j=1
sim(C
i
;w
q
j
) (6.2)
where C
i
is the predicted class name for proposal r
i
, w
q
j
is the j-th word of all the N
q
noun words in the queryq. sim is a function measuring the similarity between two words.
In the training stage, knowledge k
q
i
functions as a \pooling" gate which helps visual
(Sec. 6.2.3) and language (Sec. 6.2.4) consistency branches select and reconstruct reliable
candidate proposals. In the test stage, knowledge k
q
i
lters out unrelated proposals and
increases the chance of nding the proposal containing the mentioned object (Sec. 6.2.5).
69
6.2.3 Visual Consistency
The goal of visual consistency is to optimize the attention model via learning to predict
location information contained in query-related proposals. Through predicting location
information conditioned on the input query, we expect to learn a better correlation be-
tween language and visual modalities. In weakly supervised scenario, no annotations are
available to indicate the identity of query-related proposal. Instead, we use KBP's knowl-
edge k
q
i
to provide guidance during training. We expect that knowledge k
q
i
provides a
higher score when a proposalr
i
is query related. Thus, KBP can be applied to adaptively
weight each proposal's visual consistency loss conditioned on query q.
In implementation, we rst apply a Long Short-Term Memory (LSTM) [44] model to
encode input queryq into an embedding vector q2R
dq
. A pre-trained CNN is employed
to extract visual feature v
i
2R
dv
for each proposal r
i
, and global visual feature v2R
dv
for input image x. The attention model then concatenates the embedding vector q,
image global feature v with each of the proposal's feature v
i
and projects them into an
m-dimensional subspace. A multimodal feature v
q
i
is calculated as
v
q
i
='(W
m
(qjjvjjv
i
) + b
m
) (6.3)
where W
m
2 R
m(dq +2dv )
, b
m
2 R
m
are projection parameters. '(:) is a non-linear
activation function. \jj" denotes a concatenation operator.
After projecting into the multimodal subspace, the attention model predicts a 5D
vector s
p
2R
5
via a fully connected (fc) layer (superscript \p" denotes prediction).
s
p
i
= W
s
v
q
i
+ b
s
(6.4)
where W
s
2 R
5m
and b
s
2 R
5
are projection parameters. The rst element in s
p
i
estimates the condence ofr
i
being relevant to input queryq, and the next four elements
represent the predicted location parameters for each proposal.
70
We compare the predicted location parameters with original proposal's parameters
t
i
2R
4
and calculate the regression loss
d
i
=
1
4
3
X
j=0
f(jt
i
[j] s
p
i
[j + 1]j) (6.5)
where f(:) is the smooth L1 loss function: f(x) = 0:5x
2
(jxj < 1), and f(x) =jxj
0:5(jxj 1). The location parameters t
i
are in the form [x
i1
=w;y
i1
=h;x
i2
=w;y
i2
=h] 0:5,
where x
i1
;x
i2
is the minimum and maximum x-axis location of proposal r
i
, and y
i1
;y
i2
is the minimum and maximum y-axis location.
Aided by KBP gate, we weight each proposal's regression lossd
i
based on the predicted
condence s
p
i
[0] and knowledge k
q
i
. The visual consistency lossL
k
vc
is calculated as
L
k
vc
=
N
X
i=1
(k
q
i
)(s
p
i
[0])d
i
(6.6)
where (:);(:) denotes a softmax function and a sigmoid function respectively.
6.2.4 Language Consistency
The goal of language consistency is to optimize the attention model via learning to re-
construct input query q with a language consistency constraint.
In implementation, after the attention model predicting each proposal's condence
of being relevant to query q (s
p
i
[0] in Eq. 6.4), we adopt a similar structure in [99] to
weight each proposal's visual feature v
i
and project them into a reconstruction subspace.
Dierent from [99], we introduce KBP gate into the language consistency branch to further
down-weight unrelated visual features' contribution. Thus, the knowledge conditioned
reconstruction feature is calculated as
v
k
att
= W
a
N
X
i=1
(k
q
i
)(s
p
i
[0])v
i
!
+ b
a
(6.7)
where W
a
2 R
drdv
, b
a
2 R
dr
are projections parameters to be optimized. Other
notations are the same as Eq. 6.6.
71
The reconstruction visual feature v
k
att
is then treated as the initial state of a decoding
LSTM, which predicts a sequence of probabilityfp
t
^ q
g indicating the selection of words
in each time step t of reconstructed query ^ q. With the ground truth of input query q
(selection of words w
t
in each time step t), the language reconstruction lossL
k
lc
is the
average of cross entropy for the sequencefp
t
^ q
g.
L
k
lc
=
1
T
T
X
t=1
log(p
t
^ q
[w
t
]) (6.8)
where T is the length of input query q.
6.2.5 Training & Inference
In training stage, the parameters to be optimized include parameters in encoding and
decoding LSTM and the projection parameters in Eq. 6.3, 6.4, 6.7. We regularize the
weights of projection parameters, which is the sum of`
2
norm of these parameters (L
reg
).
Same as [99], we select 100 proposals produced by proposal generation systems (N = 100).
The rectied linear unit (ReLU) is selected as the non-linear activation function '. KAC
Net is trained end-to-end using the Adam [54] algorithm.
In test stage, we feed the queryq into the trained KAC Net, and select the most related
proposal based on the condencefs
p
i
[0]g generated by the attention model (Eq. 6.4) and
external knowledgek
q
i
. The nal prediction is given as (notations are the same in Eq. 6.6):
r
j
, s.t. j
= arg max
i
f(s
p
i
[0])(k
q
i
)g (6.9)
6.3 Experiment
We evaluate KAC Net on Flickr30K Entities [93] and Referit Game [52] datasets in weakly
supervised grounding scenario.
72
6.3.1 Datasets
Flickr30K Entities [93]: There are 29783, 1000, 1000 images in this dataset for training,
validation and testing respectively. Each image is associated with 5 captions, with 3.52
query phrases in each caption on average (360K query phrases in total). The vocabulary
size for all these queries is 17150. We ignore the bounding box annotations of these two
datasets in weakly supervised scenario.
Referit Game [52]: There are 19,894 images of natural scenes in this dataset, with
96,654 distinct objects in these images. Each object is referred to by 1-3 query phrases
(130,525 in total). There are 8800 unique words among all the phrases, with a maximum
length of 19 words.
6.3.2 Experiment Setup
Proposal generation. We adopt Selective Search [109] for Flickr30K Entities [93] and
EdgeBoxes [129] for Referit Game [52] to generate proposals as grounding candidates for
fair comparison with [99] on these two datasets.
Visual feature representation. Same as [99], we choose a VGG Network [105]
netuned by Fast-RCNN [33] on PASCAL VOC 2007 [19] to extract visual features for
Flickr30K Entities, which are denoted as \VGG
det
". Besides, we follow [99] and apply a
VGG Network pre-trained on ImageNet [17] to extract visual features for both Flickr30K
Entities and Referit Game datasets, which are denoted as \VGG
cls
". Both \VGG
cls
" and
\VGG
det
" features are 4096D vectors (d
v
= 4096).
Knowledge representation. To parse dierent queries, we use the Stanford NLP
parser [76] to extract noun words in each query. We then extract probability distribu-
tions of \VGG
det
" features in MSCOCO [66] image classication task for all proposals
(#classes=90). The similarity between noun words in queries and class names are cal-
culated as the cosine distance via a word2vec program [79]. We extract probability
distributions in PASCAL VOC 2007 classication task [19] (#classes=20). Results of
dierent knowledge facilitation is provided in Sec. 6.3.3 and 6.3.4.
73
KBP gate. For KBP gate, we adopt a soft version and a hard version. Soft KBP
applies the sigmoid function to transform external knowledge k
q
i
into probability to di-
rectly weight each proposal, while hard KBP applies thresholding to force probability
being either 0 or 1 for each proposal (i.e., k
q
ih
= (k
q
is
t), is an indicator function,
subscripts \h", \s" denote hard KBP and soft KBP respectively).
In experiments, we set the threshold t as 0.3 for Flickr30K Entities and 0.1 for
Referit Game. For hard KBP, if a query's knowledge scores are 0 for all proposals (i.e.
k
q
ih
= 0;8i), we set them to be all 1 for language reconstruction in Eq. 6.7; otherwise,
reconstruction features v
k
att
provides no information to reconstruct the input query.
Model initialization. Following same settings as in [99], input queries are encoded
through an LSTM model, and the query embedding vector q is the last hidden state from
LSTM (d
q
= 512). All fc layers are initialized by Xavier method [34] and all convolutional
layers are initialized by MSRA method [41]. We introduce batch normalization layers after
projecting visual and language features in Eq. 6.3.
During training, we set the batch size as 40. The dimension of multimodal features
v
q
i
is set tom = 128 (Eq. 6.3). Hyperparameter for weight regularization is 0.005 and
for visual reconstruction loss is 10.0 in Eq. 6.1. Analysis of hyperparameters is provided
in the supplemental le.
Metric. Same as [99], we adopt accuracy as the evaluation metric, which is dened
as the ratio of phrases for which the regressed box overlaps with the mentioned object by
more than 50% Intersection over Union (IoU).
Compared approach. We choose GroundeR [99] as the compared approach, which
achieves state-of-the-art performance on both Flickr30K Entities and Referit Game datasets.
6.3.3 Performance on Flickr30K Entities
Comparison in accuracy. We rst evaluate pure visual consistency branch's perfor-
mance for weakly supervised grounding task. In Table 6.1, with a hard KBP gate, visual
consistency achieves grounding accuracy as 28.53%, which is very close to GroundeR
model. Then we introduce soft KBP gate into visual consistency branch, which brings
74
Approach Accuracy (%)
Compared approaches
GroundeR (LC) (VGG
cls
) [99] 24.66
GroundeR (LC) (VGG
det
) [99] 28.93
Our approaches
VC + Hard KBP (VGG
det
) 28.58
VC + Soft KBP (VGG
det
) 30.60
LC + Hard KBP (VGG
det
) 32.17
LC + Soft KBP (VGG
det
) 34.31
KAC Net + Hard KBP (VGG
det
) 37.41
KAC Net + Soft KBP (VGG
det
) 38.71
Table 6.1: Dierent models' performance on Flickr30K Entities. We explicitly evaluate
performance of visual consistency (VC), language consistency (LC) branches with Hard
and Soft KBP Gates. We leverage knowledge from MSCOCO [66] classication task.
Knowledge PASCAL VOC [19] MSCOCO [66]
Hard KBP 35.24 37.41
Soft KBP 36.14 38.71
Table 6.2: Comparison of KAC Net using dierent KBP gates and external knowledge
on Flickr30k Entities. Accuracy is in %.
2.03% increase in accuracy. This indicates that visual consistency, even alone, is ca-
pable of providing good performance in weakly supervised scenario. According to [99],
GroundeR model is actually a basic case of language consistency branch without a KBP
gate. We rst introduce a hard KBP gate into language consistency branch, which brings
3.42% increase in grounding performance. We then replace the hard KBP gate with a
soft KBP gate, which brings an additional 1.14% increase in performance. This further
validates the eectiveness of external knowledge in weakly supervised grounding prob-
lem. Finally, we combine visual and language consistency, which is the full KAC Net. By
applying a hard KBP gate, KAC Net achieves 37.41% in accuracy. We then replace the
hard KBP gate with a soft KBP gate. The KAC Net reaches 38.71% in accuracy, which
is a 9.78% increase over the performance of GroundeR [99]. From Table 6.1, we also nd
soft KBP gate achieves consistently better performance over hard KBP gate.
Detailed comparison. Table 6.3 provides detailed weakly supervised grounding
results based on the phrase type information for each query in Flickr30K Entities. We
75
Phrase Type people clothing body parts animals
GroundeR (VGG
det
) [99] 44.32 9.02 0.96 46.91
LC + Soft KBP 55.23 4.21 2.49 67.18
VC + Soft KBP 51.56 5.33 2.87 58.11
KAC Net (Hard KBP) 55.14 7.29 2.68 73.94
KAC Net (Soft KBP) 58.42 7.63 2.97 77.80
Phrase Type vehicles instruments scene other
GroundeR (VGG
det
) [99] 46.00 19.14 28.23 16.98
LC + Soft KBP 54.50 11.73 37.37 13.25
VC + Soft KBP 51.50 20.01 26.86 12.63
KAC Net (Hard KBP) 66.75 20.37 43.14 17.05
KAC Net (Soft KBP) 69.00 20.37 43.53 17.05
Table 6.3: Phrase grounding performances for dierent phrase types dened in Flickr30K
Entities. Accuracy is in percentage.
can observe that KAC Net provides superior results in most categories. However, dierent
models have dierent strength. Language consistency with a soft KBP gate (LC+Soft
KBP) is good at localizing \people", \animal" and \vehicles", with 10.91%, 20.27%
and 8.5% increase in accuracy compared to GroundeR model. Compared to language
consistency, visual consistency (VC+Soft KBP) is better at localizing \clothing", \body
parts" and \instruments", with 1.12%, 0.38% and 8.28% increase. However, for other
categories, visual consistency branch achieves inferior performances. By incorporating
both visual and language consistency, KAC Net observes consistent improvement in all
categories except for the category \clothing". With a soft KBP gate, KAC Net achieves
14.10%, 23.00% and 30.89% increase in localizing \people", \vehicles" and \animals".
However, KAC Net also has 1.39% drop in accuracy of localizing \clothing". This may
be because \clothing" is usually on \people". In this case, there is high chance for a
grounding system to classify \clothing" into \people" by mistake. Besides, \clothing"
does not have corresponding categories in the external knowledge.
Knowledge representation. To validate the eectiveness of external knowledge,
we also evaluate KAC Net's performance using distributions predicted by VGG Network
pre-trained on PASCAL VOC 2007 [19] image classication. In Table 6.2, we observe that
applying external knowledge achieves consistent improvement in grounding performance
76
Approach Accuracy (%)
Compared approaches
LRCN [18] 8.59
Cae-7K [38] 10.38
GroundeR [99] (LC) (VGG
cls
) 10.70
Our approaches
LC + Hard KBP (VGG
cls
) 13.02
LC + Soft KBP (VGG
cls
) 13.97
KAC Net + Hard KBP (VGG
cls
) 14.68
KAC Net + Soft KBP (VGG
cls
) 15.83
Table 6.4: Dierent models' performance on Referit Game. We leverage knowledge from
MSCOCO [66] classication task.
Knowledge PASCAL VOC [19] MSCOCO [66]
Hard KBP 12.04 14.68
Soft KBP 13.38 15.83
Table 6.5: Comparison of KAC Net using dierent KBP gates and external knowledge
on ReferitGame. Accuracy is in %.
compared to GroundeR [99] model. However, knowledge from MSCOCO [66] image
classication achieves a slight increase in accuracy compared to that from PASCAL VOC
2007 [19] image classication. This may be because MSCOCO contains more categories
of objects, and so may be more accurate in describing the proposal's relativeness to the
query.
6.3.4 Performance on Referit Game
Comparison in accuracy. Following [99], we adopt EdgeBoxes [129] as a proposal
generator. As shown in Table 6.4, by introducing KBP gate, KAC Net achieves 2.32%
(Hard KBP) and 3.27% (Soft KBP) increase compared to state-of-the-art GroundeR [99]
model. We observe using soft KBP gate achieves a slight increase in performance than
hard KBP gate. When KAC Net incorporates both visual and language consistency,
it achieves another 1.66% and 1.86% increase compared to language consistency branch
with hard and soft KBP respectively. The full model achieves 15.83% grounding accuracy,
with 5.13% increase over the GroundeR model.
77
Type A Type B All Type A Type B All
# queries 1762 15757 17519 8275 51796 60071
Soft KBP 37.26 19.77 21.53 12.88 7.74 8.45
GroundeR 26.54 29.19 28.93 7.29 11.24 10.70
G + KBP 41.03 32.17 33.06 14.16 12.56 12.78
LC + KBP 42.13 33.44 34.31 15.28 13.76 13.97
KAC Net 45.66 37.93 38.71 18.36 15.43 15.83
Table 6.6: Dierent methods on Flickr30K Entities [93] (left) and Referit Game [52]
(right) for two types of queries. Accuracy is in %.
Knowledge representation. Similar to Flickr30K Entities, we also evaluate KAC
Net's performance using knowledge from PASCAL VOC 2007 [19] image classication
task. In Table 6.5, we observe applying external learned from MSCOCO [66] image
classication achieves better performance than that from PASCAL VOC 2007 [19]. How-
ever, both knowledge representations help achieve increase in grounding accuracy over
the state-of-the-art model.
6.3.5 Discussion
To further explore KAC Net performance on dierent types of queries, we dene queries
with / without words in MSCOCO categories as \Type A" and \Type B" respectively.
In Table 6.6, we evaluate two more compared methods: soft KBP only and pre-trained
GroundeR [99] with soft KBP (denoted as \G + KBP") on both Flickr30K Entities [93]
and Referit Game [52] datasets.
From Table 6.6, pre-trained GroundeR shows a performance boost by adopting KBP.
However, after end-to-end training (LC+KBP) and applying visual consistency part,
KAC Net still outperforms state-of-the-art methods by a signicant margin. These results
also show the generalizability of KAC Net.
6.3.6 Qualitative Results
We visualize some of KAC Net's grounding results on Flickr30K Entities and Referit
Game datasets for qualitative evaluation in Fig. 6.4. For Flickr30K Entities, we rst
show the image description where the query phrases come from, then show the grounding
78
results and ground truth objects in red and green bounding boxes respectively. For Referit
Game, each query is independent with no common image descriptions, we visualize two
example images with two queries in the third row of Fig. 6.4.
We nd KAC Net is strong in recognizing people (\a girl" in the rst row) and vehicle
(\cars" in the third row), and is able to ground complex queries (\water bottle second
in the right" in the third row), which is also validated in Table 6.3. However, since KAC
Net takes only single query phrase as input, it is unable to make use of context, such as
in the example of \a man" in the third row of Fig. 6.4.
79
A girl rides a blue bike down a
city sidewalk
Query 1: A girl Query 2: a blue bike Query 3: a city sidewalk
A man is taking a photo of
another man and his two dogs
on some grassy hills
Query 1: A man (incorrect) Query 2: two dogs Query 3: some grassy hills
Query 1: red backpack Query 2: water bottle second
in the right
Query 1: cars Query 2: people standing on
the right
A lady in a red car is crossing
the bridge
Query 1: A lady Query 2: a red car Query 3: the bridge
Figure 6.4: Some phrase grounding results in Flickr30K Entities [93] (rst three rows) and
Referit Game [52] (forth row). We visualize ground truth bounding box and grounding
result in green and red respectively. When query is not clear without further context
information, KAC Net may ground reasonably incorrect objects (e.g., image in row three,
column two).
80
Chapter 7
Knowledge Level Multimodal Reasoning for Visual
Question Answering
7.1 Introduction
Visual Question Answering (VQA) is the task of answering questions, posed in natural
language, about the semantic content in an image (or video). Given an image and an
image related question, VQA answers the question in one word or a natural language
sentence. VQA is of great importance to many applications, including image retrieval,
early education, and navigation for blind people as it provides user-specic information
through the understanding of both the natural language questions and image content.
VQA is a highly challenging problem as it requires the machine to understand natural
language queries, extract semantic contents from images, and relate them in a unied
framework. In spite of these challenges, an exciting set of methods have been developed
by the research community in recent years.
Current state-of-the-art VQA models [96][75][26] generally contain a vision part, a
question understanding part and an answer generation part. The vision part extracts
visual features through a deep convolutional neural network (CNN) [60] or using a tra-
ditional visual feature extractor. The question understanding part learns a dense ques-
tion embedding feature vector to encode question semantics, either with a Bag-of-Words
model [96] or a recurrent neural network (RNN) [44] model. The answer generation
part produces an answer conditioned on the visual features and the question embed-
dings. The answer can either be a single word generated by a multi-class classier [96]
81
What is the color of the
umbrella?
Traditional VQA: analyze the whole
image -> analyze question -> give
answer: green
Attention based VQA: find umbrella
-> judge the color of umbrella -> give
answer: red
What is the color of the
coat?
Traditional VQA: analyze the whole
image -> analyze question -> give
answer: brown
Attention based VQA: find coat ->
judge the color of coat -> give
answer: yellow
Figure 7.1: Attention in visual question answering. For dierent questions, the corre-
sponding attention region varies from white dashed box \coat" in the left image to the
one \umbrella" in the right image.
or a full sentence generated by an additional RNN decoder [75][26]. The visual features
and dense question embeddings are integrated through a linear [96] / non-linear [75][26]
transform which jointly projects the features from image space and semantic space into
answer space. This integration is normally not sucient to fully exploit the relationship
of the vision part and the question understanding part because it loses the opportunity
to exploit the intent of queries to focus on dierent regions in an image.
When trying to answer a question about an image, humans tend to search the informa-
tive regions according to the question's intent before giving the answer. For example, in
Fig. 7.1, considering the query \What is the color of the coat?", it is common for humans
to focus attention on the region of coat before judging its color to answer the question.
Based on this observation, we propose a novel attention-based congurable convolutional
neural network (ABC-CNN) to locate such informative regions and give more correct an-
swers for VQA. We call the mechanism of nding informative regions based on the input
question's intent as \question-guided attention", because these regions are determined by
both images and image-related questions. As shown in Fig. 7.2, ABC-CNN contains a
vision part, a question understanding part, an answer generation part, and an attention
extraction part. We employ a CNN to extract visual features in the vision part. Instead
82
What are there hanging up ?
Embed
LSTM
w
1
w
2
w
3
w
4
w
1
w
2
w
3
w
4
w
1
w
2
w
3
w
4
w
1
w
2
w
3
w
4
CNN
Umbrellas kernel
conv
Image
feature map
Answer Generation
based on Attention
Weighted Image
Feature Map
CNN
CNN
CNN
Attention Map
Figure 7.2: The framework of ABC-CNN. The green box denotes the image feature
extraction part using CNN; the blue box is the question understanding part using LSTM;
the yellow box illustrates the attention extraction part with congurable convolution; the
red box is the answer generation part using multi-class classication based on attention
weighted image feature maps. The orange letters are corresponding variables explained
in Eq. (7.1) - (7.6).
of extracting a single global visual feature, we extract a spatial feature map to retain
crucial spatial information, by either applying a CNN in a sliding window way or with
a fully convolutional neural network. A long-short term memory (LSTM) [44] model is
used to obtain question embeddings in the question understanding part. In this paper, we
only consider the VQA task with single word answers which can be generated by utilizing
a multi-class classier in the answer generating part. Our method can be extended to
generate full sentences by using an RNN decoder.
We present the question-guided attention information as a question-guided attention
map (QAM), which is the core of the ABC-CNN framework. We model the QAM as latent
information, and do not require explicit labeling of such maps for all kinds of possible
queries. The QAM is generated by searching for visual features that correspond to the
input query's semantics in the spatial image feature map. We achieve the search via a
congurable convolution neural network, which convolves the visual feature map with a
congurable convolutional kernel (CCK). This kernel is generated by transforming the
question embeddings from the semantic space into the visual space, which contains the
visual information determined by the intent of the question. For example, in Fig. 7.1, the
83
question \what is the color of the umbrella?" should generate a CCK that corresponds to
the \umbrella" visual features. Convolving the CCK with image feature map adaptively
represents each region's importance for answering the given question as a QAM. The
generated QAMs can be utilized to spatially weight the visual feature maps to lter out
noise and unrelated information. With the visual features conditioned on the input query,
ABC-CNN can return more accurate answers from the multi-class classier in answer
generation part. The whole framework can be trained in an end-to-end way without
requiring any manual labeling of attention regions in images.
In the experiments, we evaluate the ABC-CNN framework on three benchmark VQA
datasets: Toronto COCO-QA [96], DAQUAR [74] and VQA [2]. Our method signicantly
outperforms state-of-the-art methods. Visualization of attention maps demonstrates that
the ABC-CNN architecture is capable of generating attention maps that well re
ect the
regions queried by questions.
In summary, we propose a unied ABC-CNN framework to eectively integrate the
visual and semantic information for VQA via question-guided attention. Not only does
the question guided attention signicantly improve the performance of VQA systems, but
it also helps us gain a better understanding of the question answering process.
7.2 Attention Based Congurable CNN
The framework of ABC-CNN is illustrated in Fig. 7.2. We restrict to QA pairs with single-
word answers in this paper; this allows us to treat the task as a multi-class classication
problem, which simplies the evaluation metrics so that we can concentrate on developing
question-guided attention models.
ABC-CNN is composed of four components: (1) the image feature extraction part,
(2) the question understanding part, (3) the attention extraction part and (4) the answer
generation part. In the image feature extraction part (green box), a deep CNN is used
to extract an image feature map I for each image as the image representation. We
utilize the VGG-19 deep convolutional neural network [105] pretrained on 1000-class
ImageNet classication challenge 2012 dataset [17], and a fully convolutional segmentation
neural network [14] pretrained on PASCAL 2007 segmentation dataset. The question
84
understanding part (blue box) adopts an LSTM to learn a dense question embedding
vector s to encode semantic information of an image-related question. The attention
extraction part (yellow box) congures a set of congurable convolutional kernels (CCK)
according to dierent dense question embeddings. These kernels, emphasizing the visual
features of objects asked in the question, are convolved with the image feature maps to
generate question-guided attention maps (QAM). The answer generation part, shown in
the red box, answers a question using a multi-class classier based on the image feature
map I, the attention weighted image feature map, and the dense question embedding
vector s. The rest of this section will describe each component of ABC-CNN framework
in details.
7.2.1 Attention Extraction
A QAM, m, capturing the image regions queried by the question, is generated for each
image-question pair using a congurable convolutional neural network. The congurable
convolution operation can be thought of as searching spatial image feature maps for spe-
cic visual features that correspond to the question's intent. The specic visual features
are encoded as a CCK k in this network, which is congured by projecting the dense
question embedding s from semantic space to visual space.
k =(W
sk
s +b
k
); (x) =
1
1 +e
x
(7.1)
where (:) is a sigmoid function.
The dense question embedding s encodes the semantic object information asked in
the question. The projection transforms the semantic information into the corresponding
visual information as a CCK, which has the same number of channels as the image feature
map I.
The QAM is generated by convolving the CCK k with the image feature map I, and
applying softmax normalization:
m
ij
=P (ATT
ij
jI;s) =
e
z
ij
P
i
P
j
e
z
ij
; z =k I (7.2)
85
where m
ij
is the element of the QAM at position (i;j), and the symbol * represents
the convolution operation. The QAM characterizes the attention distribution across the
image feature map. The convolution is padded so that the QAM m has the same size as
the image feature map I. The QAM corresponds to the regions asked by the question.
For example, the question \What is the color of the umbrella?" can generate an attention
map focusing on umbrella image regions because the CCK is congured to nd umbrella
visual features.
With the attention map m, we can improve the question answering accuracy on
various classes of questions for the following reasons:
• For counting questions, such as \how many cars in the image?", the attention map
lters out the unrelated regions, which makes it easier for the model to infer the
number of objects in an image.
• For color (and more general attribute) questions, such as \what is the color of the
coat?", the color of the specic object can be answered more eectively by focusing
on the object of interest.
• For object questions, such as \what is sitting on top of the desk?", the attention
map can lter out less relevant regions such as background and infer better locations
to look for objects according to their spatial relationship.
• For location questions, such as \where is the car in the image?", the attention map
is crucial for generating the correct answers because it evidently describes where
the object is in the image.
7.2.2 Question Understanding
Question understanding is crucial for visual question answering. The semantic meaning
of questions not only provides the most important clue for answer generation, but also
determines the CCKs to generate attention maps.
Recently, LSTM model has shown good performances in language understanding [44].
We employ an LSTM model to generate a dense question embedding to characterize
86
the semantic meaning of questions. A question q is rst tokenized into word sequence
fv
t
g. We convert all the upper case characters to lower case characters, and remove all
the punctuations. The words that appear in training set but are absent in test set are
replaced with a special symbol #OOV #. Besides, #B# and #E# special symbols are
added to the head and end of the sequence. According to a question dictionary, each
word is represented as a dense word embedding vector, which is learned in an end-to-end
way. An LSTM is applied to the word embedding sequence to generate a state h
t
from
each vector v
t
, using memory gate c
t
and forget gate f
t
, which is illustrated in Eq. 8.2.
i
t
=(W
vi
v
t
+ W
hi
h
t1
+b
i
); f
t
=(W
vf
v
t
+ W
hf
h
t1
+b
f
)
o
t
=(W
vo
v
t
+ W
ho
h
t1
+b
o
); g
t
=(W
vg
v
t
+ W
hg
h
t1
+b
g
)
c
t
=f
t
c
t1
+i
t
g
t
; h
t
=o
t
(c
t
)
(7.3)
where is the hyperbolic tangent function and represents the element-wise production
between two vectors. The semantic information s of the input question q is obtained by
averaging the LSTM statesfh
t
g over all time steps.
7.2.3 Image Feature Extraction
The visual information in each image is represented as anNND image feature map.
The feature map is generated by dividing an image into an NN grid, and extract-
ing a D-dimensional feature vector f in each cell of the grid. The VGG-19 [105] deep
convolutional neural network extracts a D-dimensional feature vector for each window.
TheD-dimensional feature vector for each cell is the average of all the 10 D-dimensional
feature vectors. The nalNND image feature map is the concatenation ofNND-
dimensional feature vectors.
It is also possible to exploit a fully convolutional neural network architecture [70] to
extract image feature maps more eciently. We employ the segmentation model [14]
pretrained on PASCAL 2007 segmentation dataset and nd it leads to slightly better
performance.
87
7.2.4 Answer Generation
The answer generation part is a multi-class classier based on the original image feature
map, the dense question embedding, and the attention weighted feature map.
We employ the attention map to spatially weight the image feature map I. The
weighted image feature map focuses on the objects asked in the question. The spatial
weighting is achieved by the element-wise production between each channel of the image
feature map and the attention map.
I
0
i
= I
i
m (7.4)
where represents element-wise production. I
0
i
and I
i
represent the i-th channel of
attention weighted feature map I
0
and original image feature map I, respectively. The
attention weighted feature map lowers the weights of the regions that are irrelevant to the
meaning of question. To avoid overtting, we apply an 11 convolution on the attention
weighted feature map to reduce the number of channels, resulting in a reduced feature
map I
r
. The question's semantic information s, the image feature map I and the reduced
feature map I
r
are then fused by a nonlinear projection.
h =g(W
ih
I + W
rh
I
r
+ W
sh
s +b
h
) (7.5)
whereh denotes the nal projected feature, andg(:) is the element-wise scaled hyperbolic
tangent function: g(x) = 1:7159 tanh(
2
3
x) [61]. This function leads the gradients into
the most non-linear range of value and enables a higher training speed.
A multi-class classier with softmax activation, which is trained on the nal projected
features, predicts the index of an answer word specied in an answer dictionary. The
answer generated by ABC-CNN is the word with the maximum probability.
a
= arg max
a2Va
p
a
s.t. p
a
=g(W
ha
h +b
a
) (7.6)
Notice that we do not share the word dictionary for questions and answers, i.e., one word
can have dierent indices in the question dictionary and answer dictionary.
88
7.2.5 Training and Testing
Our whole framework is trained in an end-to-end way with stochastic gradient descent
and adadelta [124] algorithm. Each batch of the stochastic gradient descent randomly
samples 64 image question pairs independently, and back propagation is applied to learn
all the weights of the ABC-CNN architecture. We randomly adjust the initialization
weights of all the layers to ensure that each dimension of the activations in all layers has
zero mean and one standard variation. The initial learning rate is set to be 0.1. In our
experiments, the weights in image feature extraction part are xed to allow faster training
speed, although it is possible to train all the weights in ABC-CNN in an end-to-end way.
During the testing stage, an image feature map is extracted for each image. Given
a question, we can produce its dense question embedding, and utilize the question em-
bedding to congure the CCK to generate the attention map. The multi-class classier
generates the answer using the original feature map, the question embedding, and the
attention weighted feature map.
7.3 Experiments
We evaluate our model on Toronto COCO-QA [96], DAQUAR [74] and VQA datasets
[2]. We evaluate our method on the QA pairs with single word answers, which accounts
for (100%, 85%, 90%) of Toronto-QA, VQA, DAQUAR datasets, respectively. It is also
consistent with the evaluation in [96]. Besides, our framework can be easily extended
to generate answers in full sentences by using an RNN decoder in the answer generation
part.
7.3.1 Implementation Details
In experiments, we rst choose the resolution of both the image feature map and the at-
tention map to be 33, which is called \ATT" model. Each image cell generates a 4096-
dimensional image feature vector using a pre-trained VGG network [5], and we extend
each feature vector with the HSV histogram of the cell, resulting in a 4276-dimensional
image feature vector for each cell. The image feature vectors from all the image cells
89
constitute an image feature map with dimension 427633. To avoid overtting, we
reduce the dimension of the feature map to 25633 with an 11 convolution. The
dimension of the dense question embedding is 256. We also try a second model called
\ATT-SEG", which employs a fully convolutional neural network [14] pretrained on PAS-
CAL 2007 segmentation dataset to generate 16161024 feature maps, and concatenates
them with HSV histograms in each cell as image feature maps. In the end, we combine
the VGG features, HSV features and segmentation features together, obtaining a model
called \ATT-VGG-SEG". It takes around 24 hours to train the network ATT on Toronto
COCO-QA dataset with four K40 Nvidia GPUs. The system can generate an answer at
9.89 ms per question on a single K40 GPU.
7.3.2 Datasets
We evaluate our models on three datasets: DAQUAR [74], Toronto COCO-QA [96] and
VQA [2].
DAQUAR dataset has two versions: the full dataset (DQ-full) and the reduced one
(DQ-reduced). DQ-reduced has question answer pairs of 37 object classes, which is a
subset of DQ-full dataset that has 894 object classes. Both versions use the indoor scenes
images from NYU-Depth V2 dataset [104]. The DQ-full dataset contains 795 training
images with 6794 QA pairs, and 654 test images with 5674 QA pairs. The DQ-reduced
dataset contains 781 training images with 3825 QA pairs and 25 test images with 286 QA
pairs. We only train and test DAQUAR dataset on QA pairs with single word answers,
which is consistent with the evaluation in [96]. Such QA pairs constitute (90.6%, 89.5%)
and (98.7%, 97.6%) in the training and test sets for DQ-full and DQ-reduced datasets,
respectively.
Toronto COCO-QA dataset uses images from Microsoft COCO dataset [66] (MS-
COCO). Its QA pairs only contain single word answers.
VQA dataset [2] is a recently collected dataset which is also built with images in
MS-COCO dataset. We evaluate the proposed model on VQA Real Image (Open-Ended)
task in VQA dataset. It contains 82783 training images, 40504 validation images, and
81434 testing images. Each image in MS-COCO dataset is annotated with 3 questions,
90
COCOQA
Q: What rests on the street next to a
bicycle?
A: puppy
DAQUAR
Q: What is the object close to the wall?
A: whiteboard
Q: What is the object in front of the sofa?
A: table
VQA
Q: How may bikes are there?
A: 2
Q: What number is the bus?
A: 48
VQA
Q: How many pickles are on the plate?
A: 1
Q: What is the shape of the plate?
A: round
Figure 7.3: Example images and image-related QA pairs in Toronto COCO-QA dataset
[96], DAQUAR dataset [74] and VQA dataset [2]. For VQA dataset, every question has
10 candidate answers. We show the answer with most votes for each question.
and each question has 10 candidate answers. The total number of QA pairs for training,
testing, and validation is 248349, 121512, 244302, respectively. We only evaluate our
method on the single-word answer QA pairs in VQA dataset, which constitute 86.88% of
the total QA pairs in this dataset. Some examples from the three datasets are shown in
Fig. 7.3.
7.3.3 Evaluation Metrics
As in [96][74], we evaluate the performance of the VQA models with \answer accuracy"
(ACC.) and \Wu-Palmer similarity measure Set" (WUPS) score [116][74]. The answer
accuracy computes the percentage of the generated answers that exactly match the ground
truth answers. The WUPS score is derived from the Wu-Palmer (WUP) similarity [116],
whose value is in the range of [0; 1]. WUP similarity measures the similarity of two words
based on the depth of their lowest common ancestor in the taxonomy tree [116]. The
WUPS score with threshold is the average of the down-weighted WUPS score for all
the generated answers and ground truth. If WUPS score of two words s
wups
is below a
threshold, their down-weighted WUPS score is 0:1s
wups
. Otherwise, its down-weighted
WUPS is s
wups
. We use WUPS scores with thresholds 0.0 and 0.9 in our experiments,
which are the same as [74].
91
Model ACC. WUPS 0.9 WUPS 0.0
LSTM 0.3676 0.4758 0.8234
IMG 0.4302 0.5864 0.8585
IB 0.5592 0.6678 0.8899
VL 0.5331 0.6391 0.8825
2VB 0.5509 0.6534 0.8864
ENSEMBLE 0.5784 0.6790 0.8952
NO-ATT 0.5414 0.6483 0.8855
ATT 0.5803 0.6814 0.8966
ATT-SEG 0.5804 0.6833 0.8979
ATT-SEG-VGG 0.5810 0.6844 0.8985
Table 7.1: Results on Toronto COCO-QA dataset [96]
7.3.4 Baseline Methods
We compare the proposed method with dierent benchmark methods used in [74][96][2][75].
All the baseline models are listed below:
• VIS+LSTM (VL): It is the framework proposed in [96], with a CNN extracting
image features followed by a dimension reduction layer. The image features are
then inserted into the head position of the question word embedding sequences as
inputs for question LSTM.
• 2-VIS+BLSTM (2VB): The image features are inserted at the head and the tail
of question word embedding sequences. Besides, the question LSTM in [96] is set
to go in both forward and backward directions.
• IMG+BOW (IB): Ren et al. [96] use Bag-of-Words features to generate the dense
question embedding.
• IMG: Only the image features are used for answering the questions. It is called a
\deaf" model.
• LSTM: The answers are generated only using the dense question embedding from
LSTM. It is called a \blind" model.
92
Model Object Number Color Location
LSTM 0.3587 0.4534 0.3626 0.3842
IMG 0.4073 0.2926 0.4268 0.4419
IB 0.5866 0.4410 0.5196 0.4939
VL 0.5653 0.4610 0.4587 0.4552
2VB 0.5817 0.4479 0.4953 0.4734
ENSEMBLE 0.6108 0.4766 0.5148 0.5028
NO-ATT 0.5882 0.4319 0.4168 0.4762
ATT 0.6217 0.4799 0.4727 0.5194
ATT-SEG 0.6238 0.4617 0.4694 0.5278
ATT-SEG-VGG 0.6246 0.4570 0.4681 0.5367
Table 7.2: Toronto COCO-QA [96] accuracy per category
• ENSEMBLE: Ren et al. in [96] evaluated the fusion model by using an ensemble
of all the above methods.
• Q+I: In [2], the question answering is achieved by training a multi-class classier
using both the dense question embeddings and image features.
• Q+I+C: Compared to Q+I model in [2], the Q+I+C model [2] adopts the dense
embeddings of labeled image captions as an additional input.
• ASK: In [75], the answers are generated by linearly combining CNN features and
question embeddings in an LSTM decoder.
7.3.5 Results and Analysis
Tables 7.1, 7.3, and 7.4 summarize the performance of dierent models on Toronto COCO-
QA, DQ-reduced and DQ-full datasets, respectively. Table 7.1 breaks down the perfor-
mance of dierent methods in each category on Toronto COCO-QA dataset.
In Table 7.1, ABC-CNN using only ATT model surpasses all the baseline models. It
also outperforms the ENSEMBLE model by 1.1% in term of answer accuracy, although
we only employ a single model. The ABC-CNN outperforms the baseline methods in
\object", \number" and \location" categories, because question-guided attention exploits
semantics of questions and the contextual information in images to answer the questions.
93
Model ACC. WUPS 0.9 WUPS 0.0
LSTM 0.3273 0.4350 0.8162
IB 0.3417 0.4499 0.8148
VL 0.3441 0.4605 0.8223
2VB 0.3578 0.4683 0.8215
ENSEMBLE 0.3694 0.4815 0.8268
NO-ATT 0.3931 0.4445 0.8230
ATT 0.4276 0.4762 0.8304
HUMAN 0.6027 0.6104 0.7896
Table 7.3: Results on DAQUAR-reduced dataset [74]
Its accuracy is slightly lower than IB and ENSEMBLE models in the \color" category.
We also nd the performance of the fully convolutional model ATT-SEG is slightly better
than ATT, while extracting feature maps with fully convolutional neural networks is much
faster. Combination of the features in ATT and ATT-SEG together (ATT-VGG-SEG)
results in the best performance. In particular, adding fully convolutional model helps
correctly answer the location questions. We also try to remove the attention in ABC-
CNN (NO-ATT) as an ablative experiment, and it results in 1.34% , 0.85%, and 0.35%
loss in accuracy, WUPS 0.9 and WUPS 0.0 scores, respectively.
In Table 7.3, we compare ABC-CNN model with the baseline models on DQ-reduced
dataset. Its performance is higher than all the single models on all the metrics. It is only
0.53% lower than the ENSEMBLE model on WUPS 0.9 measure.
On DQ-full and VQA datasets, ABC-CNN outperforms state-of-the-art methods on
both datasets in Table. 7.4 and 7.5. On DQ-full dataset, the ABC-CNN model is the
same as the models on Toronto COCO-QA and DQ-reduced dataset. On VQA dataset,
to make a fair evaluation, we employ the same answer dictionary that contains the 1000
most frequent answers (ATT 1000) as [2]. We also evaluate the ABC-CNN model using
the answer dictionary that contains all the answers (ATT Full).
Some of the generated question-guided attention maps and their corresponding images
and questions are shown in Fig. 7.4. We can observe that the question-guided attention
maps successfully capture dierent questions' intents with dierent attention regions.
With these attention maps, ABC-CNN is capable of generating more accurate answers
94
Model ACC. WUPS 0.9 WUPS 0.0
ASK 0.1943 0.2528 0.6200
ATT 0.2537 0.3135 0.6589
HUMAN 0.5020 0.5082 0.6727
Table 7.4: Results on DAQUAR-full dataset [74]
Model Q+I Q+I+C ATT 1000 ATT Full
ACC. 0.2678 0.2939 0.4838 0.4651
Table 7.5: Performances of dierent models on VQA dataset [2]
by focusing its attention on important regions and ltering out irrelevant information.
Since the original feature map is also provided when predicting answers, ABC-CNN can
answer the question without using the attention map if the object queried is the only
object in the image (e.g., the QA pair in row 2, column 2 of Fig. 7.4). In this case, the
attention map may not focus on the object queried.
95
Q: What topped with baby carrots and
beans next to a peeler?
A: Plate
Q: What does the plate have which is cut up?
A: Carrots
Q: What sits next to the laptop?
A: Bear
Q: Stuffed bear posed as if using what?
A: computer
Q: What parked on what looks like bricks?
A: Bus
Q: What is the color of the bus?
A: Yellow
Q: What are seen out and about in the wilderness?
A: Elephants
Q: What walks through the grass of a field?
A: Elephant
Q: What is at the left of the table in the
image?
A: Chair
Q: What color are the bed sheets in the image?
A: Red
Q: What is the color of the cake?
A: White
Q: What are standing next to the flower pot?
A: Dogs
Figure 7.4: Selected images with image-related questions and question-guided attention
maps generated by ABC-CNN in Toronto COCO-QA dataset [96]. We nd the proposed
ABC-CNN model is capable of focusing its attention on dierent regions for dierent
questions. The attention maps help lter out irrelevant information and model the spatial
relationship of the objects in images.
96
Chapter 8
Reasoning between Visual and Audio Modalities
8.1 Introduction
When we observe visual events in the world, such as a stick hitting a metal object, or a car
racing or a helicopter
ying, we can immediately imagine and associate some sounds with
these events. The objective of our paper is to synthesize realistic sound that correspond
to the visual content in a silent video (i.e., visually indicated sound generation). This
ability is useful for many real applications, such as sound/video editing automation,
enhanced experience of immersion in virtual reality and assistance for people with visual
impairments.
Visually indicated sound generation is a challenging problem that involves parsing
visual information and converting it into sound in audio modality. A number of methods
have been suggested in recent work such as [88, 13, 128], which adopt a Convolutional
Neural Network (CNN) to encode visual features and a Long Short Term Memory Net-
work (LSTM) [44] or a Generative Adversarial Network (GAN) to generate sound. One
common characteristic in these approaches is that they consider visually indicated sounds
to belong to variations of a single class even though the sounds for dierent activities can
be quite dierent. For example, in Fig. 8.1, the sound of hitting \iron cabinet" lasts longer
than hitting \water"; besides, spectrograms of these two sounds show dierent distribu-
tions: the sound of hitting \iron cabinet" contains more high-frequency components than
the sound of hitting \water".
97
Figure 8.1: Dierence between sound of hitting \iron cabinet" and \water" in sound wave
and spectrogram. It is hard for a generic model to handle all kinds of sound generation.
To address the signicant variations, we introduce the concept of sound classes where
each type of action generates sounds belonging to a specic class and then use class
predictions to generate more nely tuned sounds. We average sound clips of same class
to create a base sample. Given visual features, our audio generation model predicts
sound class and transforms the predicted sound class's base sample to visually consistent
sound. Furthermore, we leverage a state-of-the-art sound classication network to com-
pute a perceptual loss [50] during training; this loss aims to align the predicted sound's
semantic characteristics with ground truth in the feature space of the pre-trained sound
classication network.
In implementation, we propose a novel Perceptually Optimized Classication based
Audio generation Network (POCAN). POCAN adopts a CNN+LSTM structure to en-
code visual features, predict sound class and regress sound spectrograms. The generated
sound wave is calculated as the Inverse Short Time Fourier Transform (ISTFT) [114] of
the predicted spectrogram, which is the sum of predicted regression parameters and a
base sample corresponding to the predicted sound class. During training, a pre-trained
SoundNet [3] is deployed to compute the perceptual loss as the feature dierence between
the predicted sound and its ground truth. Analogous perceptual loss has been used for
image generation [50] but is novel to audio generation, to the best of our knowledge.
We evaluate POCAN on the popular Greatest Hits Dataset [88]. Besides, we collected
visual frames and evaluate POCAN on a subset of AudioSet [32]. Quantitative evaluations
are conducted on sound classication and retrieval tasks which have also been used in [88,
128] and have shown to have a high correlation with subjective evaluations. In both of
these tests, POCAN outperforms state-of-the-art methods by a large margin. Besides,
98
we provide some generated sound samples in the supplementary material for qualitative
evaluation.
Our contributions are three-fold: (1) We propose to generate visually indicated sound
considering dierent sound classes; (2) We leverage pre-trained SoundNet and apply a
perceptual loss to rene POCAN during training; (3) We collect a visually indicated
sound generation dataset and plan to release it upon publication.
8.2 Method
POCAN is composed of two parts: classication based audio generation and perceptual
optimization, which is shown in Fig. 8.2. In this work, we focus on generating visually
indicated sound with xed time length. We rst present the framework of POCAN in
Sec. 8.2.1, followed by the details of classication based audio generation and perceptual
optimization in Sec. 8.2.2 and 8.2.3 respectively. Finally, we illustrate how to train
POCAN and generate sound wave in Sec. 8.2.4.
8.2.1 Framework
The goal of POCAN is to generate a sound wave y given the corresponding video clip's
frame sequencefxg. We do not generate raw sound wave directly from visual information;
instead, we predict the spectrogram s of sound clip y, which can be converted back to
a wave form via an Inverse Short Term Fourier Transform (ISTFT) [114]. To achieve
this, ne-grained audio generation part predicts sound class probability distribution p as
well as spectrogram regression parameters d based on visual features. According to the
predicted distribution p, the most probable sound class's base sample is selected, and
the synthesized sound spectrogram s
0
is the addition of base sample and the predicted
regression parameters. To capture semantic characteristics of real sound y, synthesized
spectrogram s
0
is converted to wave form ^ y via ISTFT. A perceptual lossL
p
is then
99
LSTM
…
car firework water metal
Sound Class Prediction
Spectrogram Regression
Base spectro-
gram selection
ISTFT
Pre-trained
SoundNet
Classificat-
ion loss
Perceptual
loss
Regression
loss
Predicted
Sound wave
CNN
time
∆ "#
∆ "$ ∆ "%
…
… Predicted
Spectrogram
Figure 8.2: Framework of Perceptually Optimized Classication based Audio generation
Network (POCAN). Video frames are rst processed by a CNN and then fed into a
LSTM. To generate sound clips, POCAN predicts sound classes and regresses LSTM's
hidden states into spectrograms and then transform the predicted spectrograms into
sound waveforms. To increase the quality of generated sound, a pre-trained SoundNet [3]
is applied to calculate perceptual loss during the training stage.
calculated by comparing the dierence between SoundNet features of y and ^ y. The
objective for POCAN is:
arg min
X
x
L
cls
(p;c) +L
reg
(s
0
; s) +L
p
(^ y;y) (8.1)
where denotes the POCAN's parameters to be optimized. , are hyperparameters.
c is the class label of sound clip y. L
cls
is the loss for sound class prediction. L
reg
is a
regression loss for synthesizing sound spectrogram s
0
.L
p
is a perceptual loss for capturing
semantic characteristics from real sound clip y.
8.2.2 Classication based Audio generation Network
For visual input, each video frame x
i
is encoded as a visual feature vector x
i
2 R
dv
by a pre-trained CNN [42]. d
v
represents the dimension of visual feature vectors. To
encode the temporal information in video frames, we feed these video featuresfx
t
g into
a LSTM [44], where the encoding procedure can be written as
100
i
t
=(W
vi
x
t
+ W
hi
h
t1
+b
i
)
f
t
=(W
vf
x
t
+ W
hf
h
t1
+b
f
)
o
t
=(W
vo
x
t
+ W
ho
h
t1
+b
o
)
g
t
=(W
vg
x
t
+ W
hg
h
t1
+b
g
)
c
t
=f
t
c
t1
+i
t
g
t
h
t
=o
t
(c
t
)
(8.2)
where is the hyperbolic tangent function and represents the element-wise production
between two vectors. The encoded features are represented as LSTM hidden statesfh
i
g2
R
d
h
. d
h
is the dimension of hidden space.
To predict sound class and regression parameters, we project the hidden statesfh
i
g
into an audio space:
f
i
= Wh
i
+ b (8.3)
where W2R
(dc+ds)d
h
, b2R
(dc+ds)
are training parameters to be optimized. d
c
denotes
the number of sound classes, d
s
is the feature dimension of sound spectrogram. The rst
d
c
elements in f
i
represent logits of sound class probability prediction for frame x
i
, while
the rest elements record regression parameters of spectrogram. The classication lossL
cls
is:
L
cls
(p;c) = logp[c]; p =
1
t
s
ts
X
i=1
(f
i
[0 :d
c
1]) (8.4)
where is a softmax function. t
s
is the length of sequenceff
i
g, which is the same as the
time length of spectrogram to be generated.
To synthesize spectrogram, we average dierent sound spectrograms according to their
classes in the training set. Each class j then has an averaged spectrogram A
j
2R
dsts
as a base sample. The synthesized sound spectrogram is calculated as:
s
0
= d + A
j
; s.t. j
= arg max
i
fp[i]g (8.5)
101
where regression parameters d2R
dsts
are generated by stacking vectorsff
i
[d
c
:d
c
+d
s
1]g. After obtaining ne-grained sound spectrogram, the regression lossL
reg
is calculated
by a smooth L1 regression loss function g(:):
L
reg
=jjg(s
0
s)jj
1
; g(x) =
8
<
:
0:5x
2
; jxj< 1
jxj 0:5; jxj 1
(8.6)
8.2.3 Perceptual optimization
To further improve the realism of the generated sound, we leverage state-of-the-art sound
classication networks to capture dierent sound's semantic characteristics. Specically,
we adopt a pre-trained SoundNet [3], freeze its parameters and apply it to encode sound
features for both real sound y and synthesized sound ^ y during training. The synthesized
sound wave ^ y is generated from an ISTFT operation from predicted spectrogram s
0
. The
perceptual loss is then calculated by comparing features from real sound and synthesized
sound:
L
p
=jjg((^ y)(y))jj
1
(8.7)
where (:) denotes the feed-forward feature extraction process of SoundNet [3]. Other
notations are the same as Eq. 8.6.
8.2.4 Training & sound wave generation
Due to dierent sampling rates in visual and audio modalities (audio signal sample rate
is much higher than the video frame rate), the length of visual feature sequence is shorter
than the time length of audio's spectrogram. We uniformly replicate visual features in
each time step so that visual sequence's length is the same as audio's spectrogram. The
parameters to be optimized include parameters in LSTM and projection parameters in
Eq. 8.3. POCAN is trained end-to-end using the Adam [54] algorithm.
Following [88] and [128], we generate and evaluate sound wave in two ways. First is
directly converting synthesized sound spectrogram into sound wave via ISTFT; we denote
this generated sound as raw sound, which is useful for evaluating what information is
102
Class
Dog
Bark
Cattle
Sheep
Bleat
Chicken
Church
Bell
# Train 1124 739 1117 1085 1095
# Test 59 60 60 86 61
Class Helicopter
Fire
Alarm
Hammer Gunshot Fireworks
# Train 1111 854 435 1001 1109
# Test 58 60 60 171 60
Class
Thunder-
storm
Car
Racing
Rail
Transport
Splash
Water
Spray
# Train 1109 1098 1012 862 1119
# Test 62 72 167 58 60
Table 8.1: Number training and testing samples in Visually Indicated sound Generation
(VIG) dataset
captured by the audio features. Second, for the task of generating plausible visually
indicated sound to human ears, we use the generated raw sound as a query and retrieve
the nearest neighbor in the training set according to the similarities between spectrogram
features, and set the retrieved real sound wave as our generation result. The similarity
between two spectrogram features s
1
and s
2
is calculated as:
sim(s
1
; s
2
) =
*
1
t
s
ts
X
i=1
s
1
[i];
1
t
s
ts
X
i=1
s
2
[i]
+
(8.8)
whereh:;:i represents cosine distance. We denote this second type of sound as exemplar
sound, which is used for retrieval and human evaluation.
8.3 Datasets
We evaluate POCAN on two datasets: Greatest Hits Dataset [88] and a manually anno-
tated subset from AudioSet [32].
Greatest Hits Dataset (GHD) [88] contains 977 videos from indoor (64%) and
outdoor (36%) scenes. There are 733 videos (21436 clips) and 244 videos (7008 clips)
in this dataset for training and testing respectively. There are 17 sound classes in GHD
(d
c
= 17). Each labeled video lasts 0.5s with a single class label.
103
Figure 8.3: Some examples of video clips and corresponding sound clips of VIG dataset
104
Visually Indicated sound Generation dataset (VIG) is a subset from Au-
dioSet [32]. AudioSet [32] is a large-scale dataset of manually annotated audio events.
There are 2,084,320 human labeled 10-second sound clips in 632 audio event classes from
Youtube videos. Among them, we manually select 16,024 high quality sound clips in
15 classes which have strong correlation with visual frames (d
c
= 15). Each sound clip
belongs to one video. The number of training and test clips for each class is shown in
Table 8.1. Some examples of video clips and corresponding sound clips are visualized in
Fig. 8.3.
8.4 Experiments
POCAN is evaluated on GHD and VIG for visually indicated sound generation task.
Since no public implementation is available for state-of-the-art method on GHD
1
, we
re-implemented the method in [88], but there may be dierences from the authors' im-
plementation.
8.4.1 Experiment setup
We introduce the details of feature representation, model initialization, evaluation metric
and compared method in this subsection.
Audio feature representation. To calculate regression loss in Eq. 8.6, we compute
spectrogram of each sound wave via a Short Time Fourier Transform (STFT) [114] oper-
ation. We use Hann window [40] of size 2048 to encode and decode sound spectrograms.
The feature dimension is 1025 (d
s
= 1025). We use sample rate of 22.05kHz and 8kHz
for sound wave on GHD and VIG respectively. The time length for spectrograms is 22
(t
s
= 22) on GHD and 157 on VIG (t
s
= 157). We denote spectrogram feature as \spec".
For fair comparison with [88], we also extract cochleagram [82] for each sound clip, which
is denoted as \coch", the feature dimension is 42 (d
s
= 42). For perceptual loss in Eq. 8.7,
we apply a pre-trained SoundNet [3] and extract its conv7 features for each sound clip's
real and predicted sound wave during training. The feature dimension is 1024.
1
Project page of Greatest Hits Dataset (GHD): http://vis.csail.mit.edu
105
Visual feature representation. We apply a 200-layer ResNet [42] pre-trained
on ImageNet [17] to extract visual feature for each frame in a video clip. The feature
dimension is 2048 (d
v
= 2048). We denote these features as \res". For fair comparison
with [88], we also apply an AlexNet [59] pre-trained on ImageNet [17] to extract visual
features, with dimension d
v
= 4096. We denote these features as \alex".
Model initialization. During training, we set the batch size as 40. Hyperparameters
; are set to be 50, 100 during training respectively. The dimension of hidden states of
LSTM is 128 (d
h
= 128). We apply Xavier method [34] to initialize training parameters
in POCAN.
Evaluation metric. We choose Recall at top K (R@K) as the evaluation metric. For
generating each exemplar sound, we check the top K retrieved samples from the training
set in the ranking list of each test sample. If there exists a retrieved training sample
having the same sound class as the test sample, we consider it as a successful retrieval.
R@K measures the success ratio of all test samples in the top K retrieval results. Besides,
to compare with [88], we also train a 5-layer sound neural network classier from real
sound in the training set, and feed generated raw sound to check the classication results.
Compared approach. We choose [88] as the compared method, which achieves
state-of-the-art performance on GHD. We also re-implemented and evaluated [88] on VIG.
We are unable to compare with [128] as the code and dataset is not publicly available at
this time.
8.4.2 Performance on GHD
We evaluate dierent models' exemplar sound for R@K and raw sound for classication
task on GHD respectively.
Comparison in R@K. Following the settings of [88], we adopt AlexNet features for
visual modality and cochleagram for audio modality. The performance of [88] (alex +
coch) is shown in Table 8.2. By replacing audio features with spectrogram, we observe a
slight improvement in R@K (0.44% in R@1). Fixing spectrogram features, we then replace
AlexNet features with ResNet features (Owens et al. [88] (res + spec)), and observe a
further improvement of 2.19%, 2.72%, 7.55% in R@1, R@5, R@10 respectively.
106
Model K = 1 K = 5 K = 10
Owens et al. [88] (alex + coch) 0.1471 0.3982 0.4896
Owens et al. [88] (alex + spec) 0.1515 0.4077 0.5083
Owens et al. [88] (res + spec) 0.1734 0.4349 0.5838
CAN (res + spec) 0.3013 0.5364 0.6476
POCAN (res + spec) 0.3302 0.5719 0.6553
Table 8.2: Dierent models' performances of R@K on GHD (K=1, 5, 10)
Based on this feature combination (res + spec), we evaluate the Classication based
Audio generation Network (CAN) in POCAN. In Table 8.2, we observe a signicant
improvement of 12.79%, 10.15%, 6.38% in R@1, R@5, R@10 respectively. This indicates
CAN generates better sound so that the most similar retrieved samples contain more
characteristics of the test sample's sound class. We further evaluate the full POCAN
model on GHD and observe that POCAN achieves new state-of-the-art performance in
similar sound retrieval task, with 15.68%, 13.70% and 7.15% increase over the method
of [88] (res + spec) on R@1, R@5, R@10 respectively.
Sound classication. Similar to [88], we evaluate whether generated sound contains
semantic information of the sound class. We train a 5-layer neural network to classify
dierent sounds. Each layer is a 1D convolution layer followed by a rectied linear unit
(ReLU) non-linear activation function. This network is trained by real sound clips from
the training set of GHD. In test stage, we generate the raw sounds from dierent models,
and feed them into the pre-trained classier. We calculate average classication accuracy
for test set of GHD as the metric. The classier's performance as well as dierent models'
sound classication accuracies are provided in Table 8.3. It is worth noticing that our
Model Accuracy (%)
Owens et al. [88] (res + spec) 20.11
CAN (res + spec) 35.46
POCAN (res + spec) 36.32
Real sound clips 51.34
Table 8.3: Classication accuracy of dierent model's generated sound by a pre-trained
5-layer neural network classier.
107
Figure 8.4: Comparison of confusion matrices of sound classication results by a pre-
trained 5-layer neural network classier. Each row is the confusion made for a single
sound class. Left and right gure is confusion matrix of sound generated by [88] and
POCAN respectively.
Model K = 1 K = 5 K = 10
Owens et al. [88] (res + spec) 0.0997 0.2888 0.4640
CAN (res + spec) 0.1180 0.3469 0.4709
POCAN (res + spec) 0.1223 0.3625 0.4802
Table 8.4: Dierent models' performances of R@K on VIG (K=1, 5, 10)
neural network classier achieves 51.34% classication accuracy, while a pre-trained SVM
mentioned in [88] achieves 45.8%, which indicates that our classier is better at classifying
sound.
We observe that ne-grained generation part achieves better performance than [88]
in sound classication task, This indicates that raw sound generated by ne-grained
generation part provides more sound class information than that of [88], which is easier for
classier to recognize. We further apply the perceptual loss and evaluate sound generated
by POCAN. Our classier reports the highest classication accuracy of 36.32%, which
is 16.11% higher than that of [88]. Besides, we draw the confusion matrix for sound
classication results of [88] and POCAN in Fig. 8.4. From confusion matrices, we nd
POCAN's sound achieves consistently better performance over [88] in all sound categories.
The sound classes with obvious improvement include tile, water and cloth.
108
Figure 8.5: Spectrograms of ground truth sound (GT) and retrieved exemplar sound by
POCAN on GHD dataset. For each sample, we label some moments when actions happen
in GT and exemplar sound.
109
Figure 8.6: Spectrograms of ground truth sound (GT) and retrieved exemplar sound by
POCAN on VIG dataset. For each sample, we label some moments when actions happen
in GT and exemplar sound.
110
8.4.3 Performance on VIG
Comparison in R@K. Based on the feature combination of ResNet as visual features
(\res") and spectrogram as audio features (\spec"), we evaluate dierent models' R@K
on VIG. In Table 8.4, by adopting CAN, we observe an improvement of 1.83%, 5.81% and
0.69% in R@1, R@5 and R@10 respectively. After applying the perceptual loss, POCAN
achieves the state-of-the-art performance, with 2.26%, 7.37% and 1.62% increase in R@1,
R@5 and R@10 over [88] respectively. We notice that the room for improvement on VIG
is still big. This may be because the time length of sound clips in VIG is 10 seconds,
while it is only 0.5 second on GHD. In this case, the sequences of both audio and visual
features become 20 times longer, which brings extra diculty for a system to generate
reasonable sound.
8.4.4 Qualitative evaluation
For qualitative evaluation, we visualize some spectrograms of exemplar sound generated
by POCAN as well as its corresponding ground truth in Fig. 8.5, 8.6. For each sample, we
label its sound class and some time points when action happens in that clip. We observe
the pattern of exemplar sound is similar to ground truth sound, and the occurrence of
sound events are temporally close. However, POCAN also retrieves less similar samples
which contain more actions or noise (e.g., rst result in row 2 of Fig. 8.6). The project
page is inhttp://www.github.com/kanchen-usc/VIG, with demo video availablehttps:
//www.youtube.com/watch?v=ewtwCp7zeoY.
111
Chapter 9
Conlusion and Future Work
In previous sections, I described my eorts towards solving multimodal reasoning prob-
lems, from coarse level image retrieval to visual question answering problems. In this
chapter, I conclude the thesis and discuss a few possible future directions.
9.1 Conclusion
In Chapter 3, we provide a possible solution in image retrieval, which focuses on coarse
level multimodal reasoning. We proposed an Attention guided Multi-modal Correlation
(AMC) learning method in [6]. AMC models adaptively attend on useful modalities
and lter out unrelated information within each modality according to the input query's
intent. AMC framework can be further boosted by incorporating more image related
modalities and external knowledge. This will be discussed in future work.
From Chapter 4-6, we focus on ne-grained level of multimodal reasoning. In Chap-
ter 4, we proposed a novel Multimodal Spatial Regression with semantic Context (MSRC)
system [9], which focuses on phrase grounding problem. Given a query and query-related
context information, MSRC system applies a Spatial Regression Network (SRN) to pre-
dict the mentioned object's location based on the proposal bounding box with the highest
probability. Besides, MSRC system applies a Context Renement Network (CRN) to re-
ne the results by encoding context information and adopting a novel joint prediction loss
during training stage. MSRC system not only relieves the performance limitation brought
from proposal generation system, but also takes advantage of context information to lter
112
out confusing candidates. MSRC system has a signicant improvement in performance
compared to state-of-the-arts on Flickr30K Entities [93] and Refer-it Game [52] datasets.
In Chapter 5, we proposed a novel deep learning network (QRC Net) to address the
phrase grounding task [10]. Given a query and query-related context information, QRC
Net incorporates a Proposal Generation Network (PGN) to generate proposals online.
Then a Query-guided Regression Network (QRN) is applied to predict the mentioned ob-
ject's location based on the proposal bounding box with the highest probability. To guide
QRN to select more discriminative proposals related to query, QRC Net applies a Context
Reinforced Network (CPN) to rene the results by assigning rewards as policy gradients.
QRC Net not only relieves the performance limitation brought from proposal generation
systems, but also leverages context information to further boost performance. Experi-
ments show QRC Net provides a signicant improvement in performance on Flickr30K
Entities [93] and Referit Game [52] datasets.
In Chapter 6, we proposed a novel Knowledge Aided Consistency Network (KAC Net)
to address the weakly supervised grounding task [7]. KAC Net applies both visual and
language consistency to guide the training and leverages free complementary knowledge
to boost performance. Experiments show KAC Net provides a signicant improvement
in performance compared to state-of-the-arts, with 9.78% and 5.13% increase in accuracy
on Flickr30K Entities [93] and Referit Game [52] datasets respectively.
In Chapter 7, we focus on knowledge level of multimodal reasoning by addressing
Visual Question Answering (VQA) problem. We propose a unied attention based con-
gurable convolutional neural network (ABC-CNN) framework in [11]. It unies the
visual feature extraction and semantic question understanding via question-guided atten-
tion map. The attention map is generated by a congurable convolution network that
is adaptively determined by the meaning of questions. ABC-CNN signicantly improves
both visual question answering performance and the understanding of the integration of
question semantics and image contents.
Finally, in Chapter 8, we consider acoustic modality in multimodal reasoning. We
proposed a novel Perceptually Optimized Classication based Audio generation Net-
work (POCAN) which aims to produce visually indicated sound conditioned on video
113
frames [12]. Compared to previous methods, we consider sound class information and
adopt a perceptual loss during training stage. To evaluate POCAN, we collect a visually
indicated sound generation dataset from AudioSet [32]. Experiments show that POCAN
provides signicant improvement on two datasets.
9.2 Future Work
There are several remaining questions Id love to pursue in the future.
Including more modalities in multimodal reasoning
I only focused on reasoning between static images and natural language modalities.
However, there are much more modalities with dierent characteristics and diculties.
For example, the interaction between video and natural language [83, 30, 4, 27] remains a
challenging task. We can leverage current video detection techniques [101, 102, 103, 65, 31]
to apply in this problem. Besides, 3D reconstruction from video modality [62, 126, 87]
has important applications in self-driving problem, which is also an interesting problem
for future exploration.
Providing more ecieint algorithms in weakly supervised scenario
I assumed training data is available for multimodal reasoning. However, this is not
true for realistic problems. For image grounding problem, each image is associated with
several queries and there are thousands of images in one dataset. Manual labeling is
extreme expensive and suer human bias. How to deal with multimodal reasoning in
weakly supervised or unsupervised scenario is an urgent problem.
Combining new techniques to make real impacts
Recent creative generative adversarial network (GAN) structure [36] provides useful
methods to train neutral network automatically. We can apply this structure to train
multimodal system to be more discriminative while requires less training data. Besides,
how to apply multimodal reasoning to make real impacts like self-driving car, social
network, recommendation system is the ultimate goal. In the following years, I also
would like to further push this direction.
114
Bibliography
[1] K. Andrej and F.-F. Li. Deep visual-semantic alignments for generating image
descriptions. In CVPR, 2015.
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Z., and D. Parikh.
VQA: Visual question answering. In ICCV, 2015.
[3] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations
from unlabeled video. In NIPS, 2016.
[4] R. Bolles, B. Burns, J. Herson, G. Myers, J. van Hout, W. Wang, J. Wong, E. Yeh,
A. Habibian, D. Koelma, et al. The 2014 sesame multimedia event detection and
recounting system. In Proc. TRECVID Workshop, 2014.
[5] K. Chateld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in
the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531,
2014.
[6] K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia. AMC: Attention guided
multi-modal correlation learning for image search. In CVPR, 2017.
[7] K. Chen, J. Gao, and R. Nevatia. Knowledge aided consistency for weakly super-
vised phrase grounding. In CVPR, 2018.
[8] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia. MSRC: Multimodal spatial regression
with semantic context for phrase grounding. In ICMR, 2017.
[9] K. Chen, R. Kovvuri, J. Gao, and R. Nevatia. MSRC: Multimodal spatial regression
with semantic context for phrase grounding. IJMIR, 2018.
[10] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network with con-
text policy for phrase grounding. In ICCV, 2017.
[11] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. ABC-CNN:
An attention based convolutional neural network for visual question answering.
CVPRW, 2016.
[12] K. Chen, C. Zhang, C. Fang, Z. Wang, T. Bui, and R. Nevatia. Visually indi-
cated sound generation by perceptually optimized classication. In ECCV MULA
Workshop, 2018.
[13] L. Chen, S. Srivastava, Z. Duan, and C. Xu. Deep cross-modal audio-visual gener-
ation. In ACM MM Workshop, 2017.
[14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic
image segmentation with deep convolutional nets and fully connected crfs. ICLR,
2015.
115
[15] G. Chowdhury. Introduction to modern information retrieval. Facet publishing,
2010.
[16] D. Datta, S. Varma, S. K. Singh, et al. Multimodal retrieval using mutual infor-
mation based textual query reformulation. ESWA, 2017.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scale
hierarchical image database. In CVPR, 2009.
[18] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual
recognition and description. In CVPR, 2015.
[19] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
PASCAL Visual Object Classes Challenge. In IJCV, 2010.
[20] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll ar, J. Gao, X. He,
M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In
CVPR, 2015.
[21] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multi-
modal compact bilinear pooling for visual question answering and visual grounding.
EMNLP, 2016.
[22] K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canon-
ical correlation analysis. JMLR, 2007.
[23] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual
captions with styles. CVPR, 2017.
[24] C. Gan, Y. Li, H. Li, C. Sun, and B. Gong. VQS: Linking segmentations to ques-
tions and answers for supervised attention in vqa and question-focused semantic
segmentation. In ICCV, 2017.
[25] C. Gan, T. Yang, and B. Gong. Learning attributes equals multi-source domain
generalization. In CVPR, 2016.
[26] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a
machine? dataset and methods for multilingual image question answering. arXiv
preprint arXiv:1505.05612, 2015.
[27] J. Gao*, K. Chen*, and R. Nevatia. Ctap: Complementary temporal action pro-
posal generation. ECCV, 2018.
[28] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motion-appearance co-memory networks
for video question answering. In CVPR, 2018.
[29] J. Gao, C. Sun, Z. Yang, and R. Nevatia. TALL: Temporal activity localization via
language query. ICCV, 2017.
[30] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. TURN TAP: Temporal unit
regression network for temporal action proposals. ICCV, 2017.
[31] R. Ge, J. Gao, K. Chen, and R. Nevatia. MAC: Mining activity concepts for
language-based temporal localization. WACV, 2019.
[32] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore,
M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for
audio events. In ICASSP, 2017.
116
[33] R. Girshick. Fast R-CNN. In ICCV, 2015.
[34] X. Glorot and Y. Bengio. Understanding the diculty of training deep feedforward
neural networks. In AISTATS, 2010.
[35] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for
modeling internet images, tags, and their semantics. IJCV, 2014.
[36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[37] A. Gordo, J. Almazan, J. Revaud, and D. Larlus. Deep image retrieval: Learning
global representations for image search. ECCV, 2016.
[38] S. Guadarrama, E. Rodner, K. Saenko, N. Zhang, R. Farrell, J. Donahue, and
T. Darrell. Open-vocabulary object retrieval. In Robotics: science and systems,
2014.
[39] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis:
An overview with application to learning methods. Neural computation, 2004.
[40] F. J. Harris. On the use of windows for harmonic analysis with the discrete fourier
transform. Proceedings of the IEEE, 1978.
[41] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiers: Surpassing
human-level performance on imagenet classication. In CVPR, 2015.
[42] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
In CVPR, 2016.
[43] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.
NIPS Workshop, 2014.
[44] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,
1997.
[45] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language
object retrieval. In CVPR, 2016.
[46] X.-S. Hua, L. Yang, J. Wang, J. Wang, M. Ye, K. Wang, Y. Rui, and J. Li. Clickage:
towards bridging semantic and intent gaps via mining click logs of search engines.
In ACM Multimedia, 2013.
[47] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep
structured semantic models for web search using clickthrough data. In ACM CIKM,
2013.
[48] J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang. Aligning where to see and what to tell:
image caption with region-based attention and scene factorization. arXiv preprint
arXiv:1506.06272, 2015.
[49] T. Joachims. Optimizing search engines using clickthrough data. In ACM
SIGKDD, 2002.
[50] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer
and super-resolution. In ECCV, 2016.
[51] J. Justin, K. Andrej, and F.-F. Li. Densecap: Fully convolutional localization
networks for dense captioning. In CVPR, 2016.
117
[52] S. K., V. O., M. M., and T. L. B. Referit game: Referring to objects in photographs
of natural scenes. In EMNLP, 2014.
[53] A. Karpathy, A. Joulin, and F.-F. Li. Deep fragment embeddings for bidirectional
image sentence mapping. In NIPS, 2014.
[54] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR,
2015.
[55] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings
with multimodal neural language models. TACL, 2015.
[56] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and
S. Fidler. Skip-thought vectors. In NIPS, 2015.
[57] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Associating neural word embeddings with
deep image representations using sher vectors. In CVPR, 2015.
[58] B. Klein, L. Wolf, and Y. Afek. A dynamic convolutional layer for short range
weather prediction. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4840{4848, 2015.
[59] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classication with deep
convolutional neural networks. In NIPS, 2012.
[60] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural
computation, 1(4):541{551, 1989.
[61] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M uller. Ecient backprop. In Neural
networks: Tricks of the trade, pages 9{48. Springer, 2012.
[62] S. Liang, L. G. Shapiro, and I. Kemelmacher-Shlizerman. Head reconstruction from
internet photos. In ECCV, 2016.
[63] X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning
for visual relationship and attribute detection. CVPR, 2017.
[64] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and
D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
[65] T. Lin, X. Zhao, and Z. Shou. Single shot temporal action detection. In ACM
Multimedia, 2017.
[66] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ar, and
C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
[67] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn models for ne-grained
visual recognition. In ICCV, 2015.
[68] X. Lin and D. Parikh. Leveraging visual question answering for image-caption
ranking. In ECCV, 2016.
[69] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg.
SSD: Single shot multibox detector. In ECCV, 2016.
[70] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. arXiv preprint arXiv:1411.4038, 2014.
118
[71] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention
for visual question answering. NIPS, 2016.
[72] C. Lynch, K. Aryafar, and J. Attenberg. Images don't lie: Transferring deep vi-
sual semantic features to large-scale multimodal learning to rank. arXiv preprint
arXiv:1511.06746, 2015.
[73] L. Ma, Z. Lu, L. Shang, and H. Li. Multimodal convolutional neural networks for
matching image and sentence. In ICCV, 2015.
[74] M. Malinowski and M. Fritz. A multi-world approach to question answering about
real-world scenes based on uncertain input. In Advances in Neural Information
Processing Systems, pages 1682{1690, 2014.
[75] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based
approach to answering questions about images. arXiv preprint arXiv:1505.01121,
2015.
[76] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. Mc-
Closky. The stanford corenlp natural language processing toolkit. In ACL (System
Demonstrations), 2014.
[77] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with multimodal
recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
[78] B. McFee and G. R. Lanckriet. Metric learning to rank. In ICML, 2010.
[79] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ecient estimation of word repre-
sentations in vector space. ICLR Workshop, 2013.
[80] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS,
2014.
[81] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control
through deep reinforcement learning. Nature, 2015.
[82] Y. K. Muthusamy, R. A. Cole, and M. Slaney. Speaker-independent vowel recogni-
tion: Spectrograms versus cochleagrams. In ICASSP, 1990.
[83] G. K. Myers, R. Nallapati, J. van Hout, S. Pancoast, R. Nevatia, C. Sun, K. Chen,
A. Habibian, D. C. Koelma, K. E. van de Sande, A. W. Smeulders, et al. The
2013 sesame multimedia event detection and recounting system. In Proceedings of
TRECVID, 2014.
[84] V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects
for referring expression understanding. In ECCV, 2016.
[85] H. Nakayama, T. Harada, and Y. Kuniyoshi. Evaluation of dimensionality reduction
methods for image auto-annotation. In BMVC, 2010.
[86] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep
learning. In ICML, 2011.
[87] Q. Ning, K. Chen, L. Yi, C. Fan, Y. Lu, and J. Wen. Image super-resolution via
analysis sparse prior. IEEE Signal Processing Letters, 2013.
119
[88] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman.
Visually indicated sounds. In CVPR, 2016.
[89] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient
sound provides supervision for visual learning. In ECCV, 2016.
[90] D. H. Park, D. Yang, A. Fukui, A. Rohrbach, T. Darrell, and M. Rohrbach. Multi-
modal compact bilinear pooling for visual question answering and visual grounding.
In EMNLP, 2016.
[91] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature
maps. In ACM SIGKDD, 2013.
[92] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and
S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. In ICCV, 2015.
[93] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and
S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. In IJCV, 2016.
[94] F. Radenovi c, G. Tolias, and O. Chum. CNN image retrieval learns from bow:
Unsupervised ne-tuning with hard examples. In ECCV, 2016.
[95] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unied,
real-time object detection. In CVPR, 2016.
[96] M. Ren, R. Kiros, and R. Zemel. Exploring models and data for image question
answering. In arXiv:1505.02074. 2015.
[97] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In NIPS, 2015.
[98] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, M. Gatford, et al.
Okapi at trec-3. NIST SPECIAL PUBLICATION SP, 1995.
[99] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual
phrases in images by reconstruction. In ECCV, 2016.
[100] R. Rosipal and N. Kr amer. Overview and recent advances in partial least squares.
In Subspace, latent structure and feature selection. Springer, 2006.
[101] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: Convolutional-
de-convolutional networks for precise temporal action localization in untrimmed
videos. In CVPR, 2017.
[102] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang. Autoloc: Weakly-
supervised temporal action localization in untrimmed videos. In ECCV, 2018.
[103] Z. Shou, J. Pan, J. Chan, K. Miyazawa, H. Mansour, A. Vetro, X. Giro-i Nieto,
and S.-F. Chang. Online detection of action start in untrimmed, streaming videos.
In ECCV, 2018.
[104] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support
inference from rgbd images. In Computer Vision{ECCV 2012, pages 746{760.
Springer, 2012.
120
[105] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. CoRR, 2014.
[106] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104{3112,
2014.
[107] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press
Cambridge, 1998.
[108] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient
methods for reinforcement learning with function approximation. In NIPS, 1999.
[109] J. R. Uijlings, K. E. Van D. S., T. Gevers, and A. W. Smeulders. Selective search
for object recognition. IJCV, 2013.
[110] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko.
Translating videos to natural language using deep recurrent neural networks. arXiv
preprint arXiv:1412.4729, 2014.
[111] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image
caption generator. arXiv preprint arXiv:1411.4555, 2014.
[112] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and
Y. Wu. Learning ne-grained image similarity with deep ranking. In CVPR, 2014.
[113] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching
for phrase localization. In ECCV, 2016.
[114] P. Welch. The use of fast fourier transform for the estimation of power spectra: a
method based on time averaging over short, modied periodograms. IEEE Trans-
actions on audio and electroacoustics, 1967.
[115] R. J. Williams. Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine learning, 1992.
[116] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proceedings of the
32nd annual meeting on Association for Computational Linguistics, pages 133{138.
Association for Computational Linguistics, 1994.
[117] F. Xiao, L. Sigal, and Y. J. Lee. Weakly-supervised visual grounding of phrases
with linguistic structures. CVPR, 2017.
[118] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and
textual question answering. ICML, 2016.
[119] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio.
Show, attend and tell: Neural image caption generation with visual attention. arXiv
preprint arXiv:1502.03044, 2015.
[120] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for
image question answering. CVPR, 2016.
[121] T. Yao, T. Mei, and C.-W. Ngo. Learning query and image similarities with ranking
canonical correlation analysis. In ICCV, 2015.
[122] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in
referring expressions. In ECCV, 2016.
121
[123] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-listener-reinforcer model
for referring expressions. CVPR, 2017.
[124] M. D. Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint
arXiv:1212.5701, 2012.
[125] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding
network for visual relation detection. CVPR, 2017.
[126] J. Zhang, C. Kan, A. G. Schwing, and R. Urtasun. Estimating the 3d layout of
indoor scenes and its clutter from depth sensors. In ICCV, 2013.
[127] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing
for multi-label image retrieval. In CVPR, 2015.
[128] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg. Visual to sound: Generating
natural sound for videos in the wild. CoRR, 2017.
[129] C. L. Zitnick and P. Doll ar. Edge boxes: Locating object proposals from edges. In
ECCV, 2014.
[130] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation
of sentences. In ICCV, 2013.
122
Abstract (if available)
Abstract
Multimodal reasoning focuses on learning the correlation between different modalities presented in multimedia samples. It is a important task which have many applications in our daily lives, e.g., autonomous driving, robotics question answering, image retrieving engines. It is also a challenging task which is closely related to Machine Learning, Natural Language Processing and other research areas in Computer Sciences. Typically, multimodal reasoning can be divided into three levels: i) Coarse level treats each modality as a uniform sample and focus on learning inter-modal correlation. ii) Fine-grained level considers each modality's own characteristics and dive into fine-grained correlation learning. iii) Knowledge level leverages external knowledge to deal with more complex question-answering type reasoning. ❧ This thesis describes my solutions to three levels of multimodal reasoning. Most of the parts focus on the interaction between natural language modality and visual modality. The first part addresses the image retrieval problem which lies in the coarse level. We introduce attention mechanism to attend on useful information within each modality and weights each modality's importance which boosts the image retrieval performance. ❧ In the second part, we address the phrase grounding problem which is in the fine-grained level. We introduce regression mechanism, reinforcement learning techniques and multimodal consistency step by step, transfer from supervised learning scenario to weakly supervised scenario. All these above techniques bring concrete improvements in performance of phrase grounding. ❧ In the third part, we explore Visual Question Answering (VQA) problem in the knowledge level. Similar to human's behavior in VQA, we introduce the attention mechanism to attend on useful regions conditioned by input query's semantics, which filters out noise in visual modality for answering questions. ❧ Finally, we illustrate our recent efforts in other modalities' reasoning. We address the problem of generating sound waves from video content, where we consider fine-grained information and adopt perceptual loss to further polish generated sound waves. Besides, we provide some interesting problems which we plan to address in the future. ❧ This thesis demonstrates the effectiveness of all the algorithms on a range of publicly available video or image datasets.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Grounding language in images and videos
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Semantic-based visual information retrieval using natural language queries
PDF
Multimodal representation learning of affective behavior
PDF
Multimodal perception guided computational media understanding
PDF
Visual knowledge transfer with deep learning techniques
PDF
Learning shared subspaces across multiple views and modalities
PDF
Event detection and recounting from large-scale consumer videos
PDF
3D deep learning for perception and modeling
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Visual representation learning with structural prior
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Shape-assisted multimodal person re-identification
PDF
Modeling, learning, and leveraging similarity
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Object classification based on neural-network-inspired image transforms
Asset Metadata
Creator
Chen, Kan
(author)
Core Title
Multimodal reasoning of visual information and natural language
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
01/29/2019
Defense Date
11/14/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,machine learning,multimodal reasoning,natural language processing,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Jenkins, Keith (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
chenkan0007@gmail.com,kanchen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-115774
Unique identifier
UC11676666
Identifier
etd-ChenKan-7025.pdf (filename),usctheses-c89-115774 (legacy record id)
Legacy Identifier
etd-ChenKan-7025.pdf
Dmrecord
115774
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chen, Kan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
deep learning
machine learning
multimodal reasoning
natural language processing