Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodal image retrieval and object classification using deep learning features
(USC Thesis Other)
Multimodal image retrieval and object classification using deep learning features
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIMODAL IMAGE RETRIEV AL AND OBJECT CLASSIFICATION USING
DEEP LEARNING FEATURES
by
Shangwen Li
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2017
Copyright 2017 Shangwen Li
To my grandparents, my parents, and my wife.
ii
Acknowledgments
Finally, the PhD journey comes to an end. I would like to use this opportunity to show
my gratitude towards the people who helped me along the way.
I would like to first give my greatest appreciation to my PhD. advisor Professor C.-
C. Jay Kuo. Not only his deep academic knowledge and critical thinking benefit me a
lot in research, but also his wise suggestion and life-time philosophy lighten me up like
a beacon through the darkness of my PhD. study. “To be mature, you need to be willing
to take responsibility”. Those words will never be forgotten.
I would also like to thank my wife, Shiyi Zhang, for her consideration, sacrifice and
support. It is my great honor to have you for the rest of life and I truly appreciate your
company along the way.
I would like to thank my labmates, Xingze, Jasimine, Jing, Hang, Jiangyang, Martin,
Tsung-Jung, Kuan-Hsien, Xue, Hyunsuk, Harshad, Joe, Xiang, Sanjay, Brian, Sudeng,
Sachin, Chen, Jian, Xiaqing, Chun-ting, Hao, Yuzhuo for their suggestions and support.
I would like to thank my committee members of my qualifying and defense exam
for their time and knowledge. They are Prof. C.-C. Jay Kuo (Chair), Prof. Panayiotis
G. Georgiou, Prof. Aiichiro Nakano, Prof. B. Keith Jenkins and Prof. Justin P. Haldar.
Finally, I would like to thank my parents and grandparents for their consistent love
and support throughout the years.
iii
Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract xiii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Object Classification . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.1 Multimodal Image Retrieval . . . . . . . . . . . . . . . . . . . 12
1.2.2 Image Importance Prediction . . . . . . . . . . . . . . . . . . . 12
1.2.3 Object Classification . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 17
2 Background Review 19
2.1 Deep Learning and Convolutional Neural Network . . . . . . . . . . . 19
2.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Multimodal Image Retrieval with Object Tag Importance Prediction 38
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Overview of Proposed System . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Measuring Tag Importance . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Predicting Tag Importance . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Three Feature Types . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Tag Importance Prediction Model . . . . . . . . . . . . . . . . 48
iv
3.5 Multimodal Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.1 CCA and KCCA . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Retrieval Features . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Retrieval Experiment Settings . . . . . . . . . . . . . . . . . . 55
3.6.3 Performance of Tag Importance Prediction . . . . . . . . . . . 57
3.6.4 Performance of Multimodal Image Retrieval . . . . . . . . . . 61
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Multimodal Image Retrieval with Scene Tag Importance Prediction 70
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Jointly Measuring Scene and Object Tag Importance . . . . . . . . . . 73
4.2.1 Issues for Measuring Scene Tag Importance . . . . . . . . . . . 73
4.2.2 Proposed Measuring Method . . . . . . . . . . . . . . . . . . . 78
4.3 Jointly Predicting Scene and Object Tag Importance . . . . . . . . . . . 79
4.3.1 Features for Predicting Scene Tag Importance . . . . . . . . . . 79
4.3.2 Joint Structured Model . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 Inference: a Linear Integer Programming Formulation . . . . . 85
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.2 Subjective Test Performance of Measured Tag Importance . . . 90
4.4.3 Performance of Tag Importance Prediction . . . . . . . . . . . 91
4.4.4 Performance of Multimodal Image Retrieval . . . . . . . . . . 93
4.5 More Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5 Improving Object Classification via Confusing Categories Study 106
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Proposed CCIR Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2.2 Confusing Categories Identification . . . . . . . . . . . . . . . 111
5.2.3 Confusing Categories Resolution . . . . . . . . . . . . . . . . 113
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 117
5.3.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . 118
5.3.3 Evaluating Confusing Categories Identification . . . . . . . . . 121
5.3.4 Evaluating Confusing Categories Resolution . . . . . . . . . . 123
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
v
6 Conclusion and Future Work 126
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Further Research Directions . . . . . . . . . . . . . . . . . . . . . . . 128
6.2.1 Considering more Semantic . . . . . . . . . . . . . . . . . . . 128
6.2.2 Other Forms of Retrieval . . . . . . . . . . . . . . . . . . . . . 129
6.2.3 Efficient Large Scale Image Retrieval . . . . . . . . . . . . . . 129
6.2.4 Extending the CCIR to more advanced network . . . . . . . . . 130
6.2.5 Improving Top-5 performance . . . . . . . . . . . . . . . . . . 130
6.2.6 Multi-level Confusion Hierarchy . . . . . . . . . . . . . . . . . 131
Bibliography 132
vi
List of Tables
3.1 Performance comparison of tag importance prediction. . . . . . . . . . 57
4.1 Measured tag importance for Figure 4.2 based on proposed method,
where SGR stands for Scene tag Grammar Role . . . . . . . . . . . . . 79
4.2 Comparison of major image datasets. . . . . . . . . . . . . . . . . . . . 89
5.1 The overall error rates for the ImageNet validation data. . . . . . . . . . 119
5.2 A list of confusion sets that the proposed CCIR scheme has significant
top-1 error reduction on top of the VGG16 under the dense protocol. . . 120
5.3 A list of confusion sets that the proposed CCIR scheme has significant
top-5 error reduction on top of the VGG16 under the dense protocol. . . 120
5.4 Comparison of CCI methods under the dense protocol, where CM means
that the affinity matrix is generated based on the confusion matrix and
A V means the affinity matrix is generated based on the anchor vectors. . 122
5.5 Confusion hierarchy generalization performance under the dense proto-
col. The “A V” column indicates the anchor vectors used in generating
the hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.6 Performance of CCIR with (/BTS) and without (/woBTS) BTS Cluster-
ing under the dense protocol. . . . . . . . . . . . . . . . . . . . . . . . 124
5.7 Performance comparison of two classifiers for the mixed subset under
the dense protocol, where woRF denotes the probability-based classifier
and RF denotes the random forest classifier. . . . . . . . . . . . . . . . 125
vii
List of Figures
1.1 Retrieval results of text query “Tommy Trojan” from the Google Image
Search engine, where all results relate to concept of “Tommy Trojan”
but have different visual content. . . . . . . . . . . . . . . . . . . . . . 2
1.2 Retrieval results of image “Tommy Trojan Statue” from the Google
Image Search engine, where all results are visually similar to the query. 2
1.3 A Multimodal Image Retrieval Framework applied to CBIR scenario. . 5
1.4 An example of unequally important object tags. Object instances in the
images are annotated using bounding boxes with different colors. . . . 6
1.5 Two images with the same scene tag “street” but with different scene
tag importance. While the left image focuses on the scene “street”, the
right image focuses on the objects. . . . . . . . . . . . . . . . . . . . . 7
1.6 Two exemplary images from the ImageNet dataset. Even though they
are very similar in visual appearance, they come from different classes
in the dataset. While the left one comes from the class “Eskimo dog,
husky”, the right one comes from the class “Siberian husky”. . . . . . . 10
1.7 Sample images within the class “miniature pinscher”. We can observe
great visual variety in terms of background, color, presence of other
main objects etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 An Example of Convolutional Neural Network . . . . . . . . . . . . . 22
2.2 AlexNet structure [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 VGG16 structure. Source: https://www.cs.toronto.edu/ frossard/post/vgg16/ 25
3.1 An overview of the proposed MIR/TIP system. Given a query image
with important “dog” and “bicycle”, the MIR/TIP system will rank the
good retrieval examples with important “dog” and “bicycle” ahead of
bad retrieval ones with less important “dog” and “bicycle”. . . . . . . . 40
3.2 An example of object importance measure using human sentence descrip-
tions, where object tags “person” and “motorbike” appear in human sen-
tences in terms of their synonyms: “man” and “scooter”. . . . . . . . . 42
viii
3.3 An example for comparison between probability importance and dis-
counted probability importance, where the “bicycle” in both images
are equally important with probability importance but not with the dis-
counted probability importance (left “bicycle”: 0.6; right “bicycle”:
1.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Three object importance cues where texts below images give the ground
truth tag importance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 The mean and the standard deviation of the ground truth tag importance
for different object categories in the UIUC dataset. . . . . . . . . . . . 46
3.6 Object pairs with their visual properties (size, location) and semantic
categories in the exemplary database image in Fig. 3.1, where “¡”,
“ ” and “” denote higher, lower and about equal relative importance
between two objects, respectively. . . . . . . . . . . . . . . . . . . . . 48
3.7 The MRF model for the exemplary database image in Fig. 3.1. . . . . . 49
3.8 Comparison of continuous-valued tag importance prediction errors of
seven models: (a) the UIUC dataset, (b) the COCO dataset. . . . . . . . 60
3.9 Results of four importance prediction models on an image with “bus”,
“car”, and “person” tags. . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 NDCG curves for image-to-image retrieval. The dashed lines are upper
bounds for importance based MIR systems. . . . . . . . . . . . . . . . 62
3.11 Top two retrieved results for two exemplary images in the UIUC dataset,
where the four columns represent four retrieval methods: (A) Visual
only, (B) Traditional MIR, (C) MIR/PCTI, and (D) MIR/TCTI. . . . . . 63
3.12 Top four retrieved results for two exemplary images in the COCO dataset,
where the four columns represent four retrieval methods: (A) Visual
only, (B) Traditional MIR, (C) MIR/PCTI, and (D) MIR/TCTI. . . . . . 65
3.13 Top two retrieved results for two exemplary images in the UIUC dataset,
where the four columns represent four retrieval methods: (A) Visual
only, (B) Traditional MIR, (C) MIR/PCTI, and (D) MIR/TCTI. . . . . . 66
3.14 Top four retrieved results for two exemplary images in the COCO dataset,
where the four columns represent four retrieval methods: (A) Visual
only, (B) Traditional MIR, (C) MIR/PCTI, and (D) MIR/TCTI. . . . . . 67
3.15 The NDCG curves for the tag-to-image retrieval on the COCO dataset.
The dashed lines are upper bounds for importance based MIR systems. . 68
3.16 The NDCG curves for the image-to-tag retrieval on the COCO dataset.
The dashed lines are upper bounds for importance based MIR systems. . 69
4.1 4 exemplar images with object tags “car”, “person”, and “surfboard”.
(a) A “beach” scene image with unimportant object tags. (b) A image
that is not “beach” scene. (c) A “beach image” with important “car” and
“beach”. (d) A “beach” image important objects. . . . . . . . . . . . . 71
ix
4.2 Problem of treating scene tag as a special object tag and applying dis-
counted probability to measure the tag importance. An iconic “beach”
scene image with its five sentence descriptions and resulted tag impor-
tance value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 An example image with 2 associated sentence descriptions. In sentence
(1), the scene tag “beach” appeared as the main noun while “surfboard”
appears in the preposition phrase that modified the “beach”. In sentence
(2), the scene tag “beach” appeared in the verb phrase that modify the
“surfboard”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 The sentence constituent trees of the two sentences in Figure 4.3. The
acronym in the trees are: S (Sentence), NP (Noun Phrase), VP (Verb
Phrase), PP (Preposition Phrase), DT (Determiner), JJ (Adjective), NN
(Singular Noun), NNS (Plural Noun), VBN (Verb, past participle), VBP
(Verb, non-3rd person singular present), IN (Preposition or subordinat-
ing conjunction). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 The mean and the standard deviation of the ground truth tag importance
for 6 scene categories in the COCO dataset. . . . . . . . . . . . . . . . 80
4.6 3 cues for predicting scene tag importance. The ground truth importance
are shown below each image. . . . . . . . . . . . . . . . . . . . . . . . 82
4.7 A sample image and its corresponding joint MRF model. . . . . . . . . 84
4.8 The image number of each scene category in the COCO Scene dataset. . 89
4.9 The image number of each object category in the COCO Scene dataset. 90
4.10 Comparison of continuous-valued tag importance prediction errors of
seven models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 The NDCG curves for the image-to-image retrieval on the COCO Scene
dataset. The dashed lines are upper bounds for importance based MIR
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.12 Top three I2I retrieved results for two exemplary queries, where the four
columns show four retrieval systems: (A) Visual Baseline, (B) Tradi-
tional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . . . . . . . . . . . 95
4.13 The NDCG curves for the tag-to-image retrieval on the COCO Scene
dataset. The dashed lines are upper bounds for importance based MIR
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.14 Tag-to-Image retrieval results for two exemplary query with different
focus, where the three columns correspond to the top three ranked tags
of three MIR systems: (A)Traditional MIR, (B) MIR/PBTI, and (C)
MIR/PCTI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.15 The NDCG curves for auto ranked tag list generation on the COCO
Scene dataset. The dashed lines are upper bounds for importance based
MIR systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
x
4.16 Tagging results for two exemplary images, where the four columns cor-
respond to the top three ranked tags of four MIR systems: (A) Baseline,
(B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . . . . . 99
4.17 I2I Query 1. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 99
4.18 I2I Query 2. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 100
4.19 I2I Query 3. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 100
4.20 I2I Query 4. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 101
4.21 I2I Query 5. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 101
4.22 T2I Query 1. The three columns show four retrieval systems: (A) Tra-
ditional MIR, (B) MIR/PBTI, and (C) MIR/PCTI. . . . . . . . . . . . . 102
4.23 T2I Query 2. The three columns show four retrieval systems: (A) Tra-
ditional MIR, (B) MIR/PBTI, and (C) MIR/PCTI. . . . . . . . . . . . . 102
4.24 T2I Query 3. The three columns show four retrieval systems: (A) Tra-
ditional MIR, (B) MIR/PBTI, and (C) MIR/PCTI. . . . . . . . . . . . . 103
4.25 T2I Query 4. The three columns show four retrieval systems: (A) Tra-
ditional MIR, (B) MIR/PBTI, and (C) MIR/PCTI. . . . . . . . . . . . . 103
4.26 Tagging result 1. The four columns show four retrieval systems: (A)
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 104
4.27 Tagging result 2. The four columns show four retrieval systems: (A)
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 104
4.28 Tagging result 3. The four columns show four retrieval systems: (A)
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 104
4.29 Tagging result 4. The four columns show four retrieval systems: (A)
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 104
4.30 Tagging result 5. The four columns show four retrieval systems: (A)
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI. . . . 105
5.1 An illustration of subsets within a confusion set that contains multiple
object categories (e.g. thunder snake, king snake, etc.) as shown in the
bottom-left part of the figure. The top row: two groups of images under
the same “horned rattlesnake” category. The bottom row: two subsets
in a confusion set that contains the “horned rattlesnake” category. The
two subsets of images encircled by the green box are visually similar,
thus leading to the confusion set. . . . . . . . . . . . . . . . . . . . . . 107
5.2 An overview of the proposed CCIR system. It consists of 3 modules: 1)
a baseline CNN, 2) a Confusing Categories Identification (CCI) module,
and 3) a Confusing Categories Resolution (CCR) module. . . . . . . . . 109
xi
5.3 Illustration of projections of image feature onto various anchor vectors.
The input image comes from the category “horned rattlesnake”, and its
feature (red line) projections onto the anchor vectors of “snake” related
categories have similar values. . . . . . . . . . . . . . . . . . . . . . . 113
5.4 A subgraph of constructed confusion graph, where each node represent
a object category, and the width of the edges between nodes indicates
the degree of confusion. . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Illustration of trees for two confusion sets obtained by the BTS cluster-
ing. They include 3 and 2 object categories, respectively. Pure subsets
are encircled by dashed-line bounding boxes. . . . . . . . . . . . . . . 116
5.6 Case studies on the ImageNet dataset, where each row represents a test-
ing case. Column (a): the test image with ground truth label. Column
(b): top 5 guesses from the VGG16 under the dense protocol. Column
(c): top 5 guesses from the VGG16+CCIR under the dense protocol. . . 121
6.1 An example where the object relation plays an important role in achiev-
ing good retrieval performance. While the query and candidate 2 images
have the same object relation representing action “person riding horse”,
the relation between “person” and “horse” in candidate 1 image is sim-
ply geometrical “next to”. . . . . . . . . . . . . . . . . . . . . . . . . . 129
xii
Abstract
Computer vision has achieved a major breakthrough in recent years with the advance-
ment of deep learning based methods. However, its performance is still yet to be claimed
as robust for practical applications, and more advanced methods on top of deep learning
architecture are needed. This work targets at using deep learning features to tackle two
major computer vision problems: Multimodal Image Retrieval and Object Classifica-
tion.
Multimodal Image Retrieval (MIR) aims at building the alignment between the
visual and textual modalities, thus reduce the well known “semantic gap” in image
retrieval problem. As the most widely existing textual information of images, tag plays
an important semantic role in MIR framework. However, treating all tags in an image as
equally important may result in misalignment between visual and textual domains, lead-
ing to bad retrieval performance. To address this problem and build a robust retrieval
system, we propose an MIR framework that embeds tag importance as the textual fea-
ture. In the first part, we propose an MIR system, called Multimodal Image Retrieval
with Tag Importance Prediction (MIR/TIP), to embed the automatically predicted object
tag importance in image retrieval. To achieve this goal, a discounted probability metric
is first presented to measure the object tag importance from human sentence descrip-
tions. Using this as ground truth, a structured object tag importance prediction model is
proposed. The proposed model integrates visual, semantic, and context cues to achieve
xiii
robust object tag importance prediction performance. Our experimental results demon-
strate that, by embedding the predicted object tag importance, significant performance
gain can be obtained in terms of both objective and subjective evaluation. In the second
part, the MIR/TIP system is extended to account “scene”, which is another important
aspect of image. To jointly measure the scene and object tag importance, the discounted
probability metric is modified to consider the grammatical role of the scene tag in the
human annotated sentence. The structured model is modified to predict the scene and
object tag importance at the same time. Our experimental results demonstrate that the
robustness of the MIR system is greatly enhanced by our predicted scene and object tag
importance.
Object classification is a long-standing problem in the computer vision field, which
serves as the foundation for other problems such as object detection, scene classification,
and image annotation. As the number of object categories continues to increase, it is
inevitable to have certain categories that are more confusing than others due to the prox-
imity of their samples in the feature space. In the third part, we conduct a detail analysis
on confusing categories and propose a confusing categories identification and resolution
(CCIR) scheme, which can be applied to any CNN-based object classification baseline
method to further improve its performance. In the CCIR scheme, we first present a
procedure to cluster confusing object categories together to form a confusion set auto-
matically. Then, a binary-tree-structured (BTS) clustering method is adopted to split
a confusion set into multiple subsets. A classifier is subsequently learned within each
subset to enhance its performance. Experimental results on the ImageNet ILSVRC2012
dataset show that the proposed CCIR scheme can offer a significant performance gain
over the AlexNet and the VGG16.
xiv
Chapter 1
Introduction
1.1 Significance of the Research
1.1.1 Image Retrieval
Image retrieval [82] is a long-standing problem in the computer vision and the informa-
tion retrieval research fields. Despite a tremendous amount of progress being made in
text search, the performance of image retrieval is still far from satisfactory. The existing
commercial image search engines, such as the Bing and the Google Image, rely heavily
on the text data. That is, it attempts to match user-input keywords (tag) to accompa-
nying texts of an image, and thus is called Tag Based Image Retrieval (TBIR). This
methodology has an obvious drawback - the visual component of the image is ignored.
Figure 1.1 shows an example of TBIR search by inputting “tommy trojan” into Google
image search engine. It can be observed that all the returned results are related to the
concept “Tommy Trojan”, but their visual contents are very different. It is worth men-
tioning that behind the scene, no image analysis and understanding are involved. What
the engine achieved is simple word matching like text retrieval, because all returned
results have already been manually annotated with key words “Tommy Trojan”.
On the other hand, the content-based image retrieval (CBIR) technique [15, 16, 91]
aims at searching visually similar images based on extracted low level features. Typi-
cally, an image is input as query into the search engine and visually similar images are
returned as retrieval results. Figure 1.2 shows an example of inputting a “Tommy Trojan
1
Tommy Trojan
Statue
Tommy Trojan
Mascot
Tommy Trojan
Logo
Figure 1.1: Retrieval results of text query “Tommy Trojan” from the Google Image
Search engine, where all results relate to concept of “Tommy Trojan” but have different
visual content.
Query
Results
Figure 1.2: Retrieval results of image “Tommy Trojan Statue” from the Google Image
Search engine, where all results are visually similar to the query.
Statue” image into the Google CBIR system, where the top ranked retrieval results is
shown on the right. It can be observed that the retrieved results are “visually similar”
rather than “semantically similar” like that in Figure 1.1. Behind the scene, numeric
features used to describe the visual properties of both query and database images are
extracted. This is to reduce the so called “sensory gap” between the object in the
real word and their numerical descriptor. During retrieval, the database images will be
ranked according to the numerical distances between their visual descriptors and that of
query image. Even though CBIR is an interesting technique that has been studied since
90s of last century, its performance now is still far from satisfactory due to the well
2
known “semantic gap”, which literally means it is very hard for computer to understand
the semantic concept behind the numeric visual descriptor. Despite recent deep learn-
ing features have alleviate this problem to some extend, it is still not robust enough and
can understand the high-level semantic concept within the images. For example, even
though the retrieved images in Figure 1.2 are similar to the query, the search engine did
not figure out that the image actually relates to the concept of “Tommy Trojan”, and thus
none of the “Tommy Trojan Mascot” and “Tommy Trojan Logo” image will appear in
the retrieved results due to their visual differences.
From the above discussion, we know that “semantically similar” and “visually sim-
ilar” are two different desired properties of the retrieved images. Which property the
users desire really depends on their search intents and goes beyond the main goal of this
dissertation. In this work, we will focus on reducing the “semantic gap”, a long standing
but hard problem in image retrieval.
While the TBIR approach can easily achieve “semantically similar” through text
matching, one can not assume that all images on the web are well annotated with text
information. Furthermore, for CBIR approach, the query image will not have associated
semantic information, making “semantic gap” still a major obstacle even if the database
images have well annotated text.
To tackle the above problems, one way is to leverage the well annotated online
images to help reduce the “semantic gap”. Due to the bombard of information on the
internet, there are certain amount of images that are well annotated with text informa-
tion such as tags, labels, sentences or even paragraphs. For these images, the textual
information provide meaningful semantic information while the image provide detail
visual content. In other words, the complementary nature of texts and images can pro-
vide more complete descriptions of underlying content. Thus, it is intuitive to combine
3
image and text modalities to boost the image retrieval performance, and this approach
is usually called Multimodal Image Retrieval (MIR).
MIR works by building a bridge between semantic and visual domain to reduce the
semantic gap. Figure 1.3 illustrates the MIR framework in the CBIR scenario. Suppose
we have sufficient amount of database images with textual information (tags) avail-
able. The key idea of MIR is to build a link between the visual and textual domain of
images using machine learning algorithms such as Kernel Canonical Correlation Anal-
ysis (KCCA) [12, 32, 37, 42, 43, 80] or more recently Deep Learning [1, 75, 111] tech-
niques. Then, after the link between visual and textual domain is built, the visual sim-
ilarity between query image and database images can indicate the semantic similarity,
thus reduce the semantic gap to some extent. Details of multimodal machine learning
algorithms will be introduced in Chapter 2 as background knowledge.
Another advantage of the MIR framework is that it can accommodate different
retrieval tasks at the same time. Since the relation between visual and textual domain
has been built using either KCCA or deep learning, the so called Cross Modality infor-
mation search can be achieved by leveraging this relation. Typically, an MIR framework
can support 3 retrieval related tasks:
1. Image to Image retrieval (I2I): in this case, the goal is to retrieve images from
database that are similar to query image. This is very similar to CBIR except that
the some database images also have textual information.
2. Text to Image retrieval (T2I): in this case, the goal is to retrieve images from
database that match the input textual query. This is very similar to TBIR except
that it will also consider visual similarity in retrieval besides semantic similarity.
4
Image
Semantic Semantic Gap
Query Image
Retrieved Images
Tags
motorbike, person,
car, sofa, dog, cat, potted
plant, bus
Database Images with Annotated Tags
Figure 1.3: A Multimodal Image Retrieval Framework applied to CBIR scenario.
3. Image to text (I2T): this is also called as automatic image annotation [4, 22, 45].
The goal of this application is to automatically label image with tags that describe
its semantic concepts.
Despite the popularity of the MIR framework in image retrieval research field, there
is an obvious problem that limit its efficacy. Consider a query and their retrieval results
shown in Figure 1.4. It is clear that both query and retrieved images have “motorcycle”,
“car” and “person” as their object tags. Consequently, the retrieval results are considered
relevant because its retrieved images have the corresponding tags in query.
However, the 3 retrieved images obviously do not have equal qualities for human
even though they have the same tags. Particularly,
5
Query Retrieval results
Motorcycle Car Person
Figure 1.4: An example of unequally important object tags. Object instances in the
images are annotated using bounding boxes with different colors.
In the query image, the major focus is the “motorcycle”. The “person” and “car”
is unimportant background object.
For the first retrieved result, it focuses on the “person sitting on car” rather than
the “motorcycle” in the background. Thus it is a bad retrieved result.
For the second retrieved result, since the “person is riding the motorcycle”, person
becomes a non-negligible object in the image beside “motorcycle”. This image is
ok as a retrieved result.
The third retrieved result is the best among the three since it is an image with
important “motorcycle” and unimportant background “person” and “car”.
From the above discussion, we can conclude that tags in an image may have different
importance, and ignoring such “tag importance” in MIR framework or even traditional
image retrieval framework may result in bad system performance. The example in Fig-
ure 1.4 uses object tags to demonstrate the necessity of “tag importance”. However, tag
importance is not only applicable to object tags, but can also extend to more rich textual
information such as scene, action, geometrical location and attribute. Figure 1.5 shows
two images in the same scene “street”. However, we can observe that the image on the
6
Street with unimportant objects Important object nearby the street
Figure 1.5: Two images with the same scene tag “street” but with different scene tag
importance. While the left image focuses on the scene “street”, the right image focuses
on the objects.
left focuses on the overall structure of scene “street”, while the right image focuses on
the objects. Hence these two images can not be considered as either visually similar or
semantically similar.
To summarize, although the MIR framework has achieved significant progress over
the past few years, its wide application is still limited by the aforementioned “tag impor-
tance” problem. Moreover, the complicated nature of image content and the interaction
among object, scene, object relation, and attribute make the image tag importance a
challenging problem.
The first and second part of this work aims at automatically predicting the image tag
importance and embedding it into the MIR framework. To achieve this, it is very impor-
tant to leverage the complementary nature of textual and visual domain. Specifically,
while tag can help reduce the semantic gap, the visual domain provides us significant
cues to predict the importance information.
In this work, we first study how the object tag importance can help make the MIR
system robust. Particularly, we first defined a discounted probability to measure the
ground truth object tag importance from human provided sentence description. Then,
7
we proposed a novel structured prediction model to automatically predict the object tag
importance from the visual content of images. Finally, CCA is adopted to incorporate
predicted object tag importance in the MIR framework. Experimental results show sig-
nificant improvements of the proposed MIR with Tag Importance Prediction (MIR/TIP)
system over traditional MIR systems that ignore tag importance.
As discussed before, only considering the object tag importance in image retrieval is
not enough to handle the diversity of image content. The scene as a major image content
can not be ignored in the MIR framework. To make our MIR/TIP system more robust,
we extend the importance prediction model to consider the scene tag besides object tags.
More specifically, we first developed a method to extract scene importance from parsed
tree of sentence description. Then we modified our model to jointly predict the scene
and object tag importance. Our experimental results demonstrate that both scene and
object tag importance are necessary to handle the diversity of real world image, and
our predicted tag importance can greatly boost the performance of the traditional MIR
system.
In short, image retrieval is a significant yet challenging problem in multimedia and
computer vision research fields due to the well known “semantic gap”. Multimodal
Image Retrieval helps to reduce “semantic gap”, but fails to consider the importance
information in image. To address this problem, we developed an MIR/TIP system to
enhance the robustness of the state-of-the-art MIR framework.
1.1.2 Object Classification
Object classification is a long-standing problem in the computer vision field, which
serves as the foundation for other problems such as object detection [30, 31, 38, 81],
scene classification [62, 108], and image annotation [61]. In recent years, the deep
Convolutional Neural Network (CNN) has achieved a significant performance gain over
8
traditional methods due to the availability of large-scale training data and a better opti-
mization procedure. However, its performance is still not as robust as humans, and its
practical usage in high demanding tasks is still to be explored.
As the number of object categories continues to increase, it is inevitable to have cer-
tain categories that are more confusing than others due to the proximity of their samples
in the feature space. One of the exemplar dataset is the ImageNet [19], which provides
images of a large number of object categories according to the WordNet hierarchy [72].
It offers a common platform for researchers to analyze and interact with large-scale
visual data. Moreover, ImageNet holds the ILSVRC (ImageNet Large Scale Visual
Recognition Challenge) annually since 2010. It includes different tasks aiming at dif-
ferent recognition problems. Among them, the classification task of ILSVRC2012 [83]
becomes the standard benchmark for object classification. Its goal is to classify images
to one of the 1000 object categories with the highest accuracy. While the 1000 object
categories cover a wide range of varieties, it is inevitable that some categories are much
harder to be distinguished than others. This phenomenon is called the “inter-class simi-
larity”, which is a major obstacle in achieving good classification performance in large
scale dataset. Figure 1.6 gives an example of “inter-class similarity”, which shows two
dogs coming from different classes in the ImageNet dataset. They are so similar that it is
hard to distinguish them even for human. Consequently, it is challenging for computer
to learn such subtle distinction among confusing categories, and special treatment needs
to be applied to these kind of hard samples.
While the “inter-class similarity” have already posed significant difficulty to the
object classification problem, the innate visual versatility within each class makes the
problem even more challenging. This is called the “intra-class variety”, which is a noto-
rious hurdle for all kinds of computer vision problems. Some exemplary images of the
class “miniature pinscher” in the ImageNet dataset are shown in Figure 1.7. Clearly,
9
Figure 1.6: Two exemplary images from the ImageNet dataset. Even though they are
very similar in visual appearance, they come from different classes in the dataset. While
the left one comes from the class “Eskimo dog, husky”, the right one comes from the
class “Siberian husky”.
Figure 1.7: Sample images within the class “miniature pinscher”. We can observe great
visual variety in terms of background, color, presence of other main objects etc.
significant visual difference exists within these images, and such variety will prevent
the deep learning method capturing the representative features of this category. There-
fore, how to efficiently handle the “intra-class variety” as well as “inter-class similarity”
plays a key role in building high performance object classification system.
10
To address the above two challenges, we propose to enhance the current deep learn-
ing based object classification system from the perspective of confusion analysis in the
third part of this work. The general idea is to first group the confusing categories into
confusion sets (for “inter-class similarity”), and optimize the classification performance
of each confusion set by dividing their images into subsets according to their visual
similarity (for “intra-class variance”). This can be decomposed into two consecutive
problems: 1) how to identify and group confusing object categories automatically using
the CNN property, and 2) how to boost the classification performance in a confusion
set. To solve the first problem, we conduct a detail analysis on confusing categories to
identify the innate reason why CNNs got confused by certain categories with others.
Rather than learning the hierarchical structure based on the CNN confusion matrix, we
proposed a clustering approach to automatically cluster confusing object categories into
confusion sets specific to a pre-trained network. To handle the second issue, we adopt
a binary-tree-structured (BTS) clustering method to split a confusion set into multiple
subsets. A classifier is subsequently learned within each subset to capture its unique dis-
criminative features and enhance its classification performance. The above two proce-
dures form the core of the proposed Confusing Categories Identification and Resolution
(CCIR) scheme. Experimental results on the ImageNet ILSVRC2012 dataset show that
the proposed CCIR scheme can offer a significant performance gain over the AlexNet
and the VGG16.
To conclude, performance of current deep learning based object classification sys-
tem is inhibited by the “inter-class similarity” and “intra-class variety” problems. In
response to these two issues, we propose a confusion analysis based system to enhance
the CNN predicted results, which achieves the state-of-the-art performance on the Ima-
geNet ILSVRC2012 dataset.
11
1.2 Related Work
In this section, I will review recent studies on the MIR systems, image importance pre-
diction and object classification that are closely related to my work.
1.2.1 Multimodal Image Retrieval
The current state-of-the-art MIR systems aim at finding a shared latent subspace
between image visual and textual features so that the information in different domains
can be represented in a unified subspace. Several learning methods have been developed
for this purpose, among which the Canonical Correlation Analysis (CCA) and its exten-
sion known as the Kernel CCA (KCCA) are the most relevant to our work. The main
idea of CCA is to find a subspace for visual and textual features so that their projections
into this lower dimension representation are maximally correlated. Hardoon et al. [37]
adopted KCCA to retrieve images based on their content using text query. Rasiwasia et
al. [80] replaced the textual modality with an abstract semantic feature space in KCCA
training. More recently, Gong et al. [32] proposed a three-view KCCA that learns the
subspace for visual, tag and semantic features jointly. Hwang and Grauman [42, 43]
adopted human provided ranked tag lists as object tag importance and used them in
KCCA learning, which is most relevant to our work.
1.2.2 Image Importance Prediction
Importance in images is a relative new concept that has gained attention in visual
research recently. Elazary and Itti [24] considered the order of naming as an inter-
estingness indicator of objects in images, and used saliency to predict their locations.
A formal study of object importance was conducted by Spain and Perona [92], where
a forgetful urn model was developed to measure object importance from ordered tag
12
lists and then low-level visual cues were used to predict the object importance value.
Berg et al. [7] used human sentence descriptions to measure the importance of objects,
scenes and attributes in images, and proposed various visual and semantic cues for pre-
diction. Parikh and Grauman [77] proposed a ranking method for the same attribute
across different images to capture its relative importance in multiple images. Instead of
predicting tag importance for images directly, some work focuses on image tags rerank-
ing to achieve better retrieval performance. For example, Liu et al. [67] developed an
unsupervised learning approach to rerank tags according to their relevance to image con-
tent. Lan and Mori [58] proposed a Max-Margin Riffled Independence Model to rerank
object and attribute tags in an image in the order of decreasing relevance or importance.
1.2.3 Object Classification
There has been various work proposed to exploit semantic confusion hierarchy of object
categories for classification performance improvement. Earlier work focused on the
use of a pre-defined hierarchy [44, 46, 70, 100] to enhance the performance of conven-
tional shallow classifiers. Zweig and Weinshall [117] presented an ensemble method
by combining models targeting at object categories in different hierarchical levels. A
WordNet-based semantic distance was proposed by Fergus et al. [29] to share labels
between different categories. Zhao et al. [114] utilized the object hierarchy to define
a loss function and select a common set of features for related categories. Later work
attempted to generate the hierarchy adaptively [3,21,35,63,71,86,90]. To automatically
learn the hierarchy from the data, Liu et al. [66] proposed to model it as a probabilis-
tic label tree, in which the final prediction is made by combining predictions from leaf
nodes.
In recent years, with the excellent performance of the deep learning classifier
[39, 53, 89, 95, 104], attention has been shifted to the usage of semantic hierarchy for
13
CNN performance enhancement. A tree-based priors transfer learning approach was
proposed in [93] to be integrated with CNN to improve the classification performance
for classes without sufficient training data. Deng et al. [18] relabeled images with class
hypernym and replaced the softmax layer in a CNN with a fully connected (FC) layer
followed by a hierarchy-and-exclusion graphical model. Xiao et al. [106] proposed
a CNN-based incremental learning approach that leveraged the category hierarchy to
optimize the model when training data of new classes are available. More recently, Yan
et al. [107] proposed a hierarchical deep CNN that separated easy classes using a coarse
category classifier while distinguishing difficult classes using fine category classifiers.
This solution is more scalable than others.
1.3 Contributions of the Research
In the first part of this research, we propose a robust Multimodal Image Retrieval system
which can automatically predict the image tag importance. The goal of our retrieval
system is to retrieve images that can preserve the important content within the query
and rank the results in decreasing order of desired content’s importance.
In the 3rd chapter, we focus on embedding automatically predicted object tag impor-
tance into retrieval framework. Its contributions include the following points:
We define a discounted probability approach to measure the ground truth object
tag importance from sentence descriptions. The proposed metric can take into
consideration of different human behaviors in describing the image content. It can
save us from laborious human work of manually ranking the tag list in decreasing
order of importance.
We identify 3 types of cues in predicting the object tag importance, namely visual,
semantic, and context. The visual cue helps identify how human perceive one
14
object tag based on its size and location. The semantic cue captures how human
think one object tag is important in prior. The context cue models how the impor-
tance of object tags in a specific image interact with each other. Numeric features
are extracted for these 3 cues respectively and will be used together to boost the
tag importance prediction performance.
We propose a structured prediction model to accommodate the interdependent
nature of object tag importance in a given image. More specifically, we use a
Markov Random Field (MRF) to model the probability distribution of object tag
importance in a given image. The structured prediction model can jointly take
into consideration of the 3 cues and make trade off between them, thus achieve
better prediction performance.
We adopt the empirical approach, which is Structured Support Vector Machine
(SSVM), to learn the structured model parameters. To accommodate the ordinal
nature of importance label, we use Mean Absolute Difference as the loss function
in the training process, which is critical to fit a good importance prediction model.
To embed the importance information into image retrieval, we use predicted tag
importance vector as textual feature during CCA semantic subspace learning. We
compare different MIR systems and show promising result of our approach.
We propose a new relevance function in Normalized Discounted Cumulative Gain
to evaluate our system. The relevance function is based on our ground truth tag
importance.
In the 4th chapter, we focus on extending our retrieval system to account scene tag
importance. Its contributions include the following points:
15
We propose a scene tag importance measurement approach based on constituent
parsing in nature language processing. This measurement will not only consider
the appearance of scene tag in sentences, but also its grammatical role. The scene
tag importance will be combined together with object tag importance to achieve
better performance in real world image retrieval application.
We identify unique semantic and visual cues for predicting the scene tag impor-
tance. Similar to object tag, semantic cue capture how human think a scene is
important in prior. Unlike object tag, whose bounding box is critical in modeling
visual cue, we need image’s global visual properties to model the visual cue for
scene tag.
We investigate the mutual influence between scene and object tag importance.
This type of interaction can be considered as a special context cue for both scene
and object tags.
Based on the proposed cues, we extend our MRF-based structured prediction
model to account scene tag importance besides object tag importance. The final
structured importance prediction model will predict the object and scene tag
importance jointly.
We conduct a subjective test to demonstrate the validity of our relevance function,
which is based on our ground truth tag importance.
We demonstrate that both scene and object tags are necessary to accommodate
the varieties of image content. Moreover, the tag importance is an indispensable
factor to achieve the robustness of Multimodal Image Retrieval system.
In the second part of this research, we propose a high performance Confusing Cat-
egories Identification and Resolution system to improve the classification accuracy of
16
Convolutional Neural Network based system. It aims at correcting the mistakes the
baseline CNN made on classifying the confusing samples within the dataset. This part
is covered in the 5th chapter with the following contributions:
We conduct theoretical analysis and identify the innate reason why a CNN per-
forms poorly on certain classes, i.e., easily get confused by certain categories than
others. The idea of anchor vector is adopted for the analysis.
Based on the theoretical analysis, we propose a data-independent but network-
specific approach to automatically group the categories into confusion sets, which
is more robust than the method based on confusion matrix of hold-out data.
We propose a binary-tree-structured clustering procedure to divide the images
within a confusion set into visually similar subsets. This makes the classifier
of each subset learn the discriminative features that help distinguish the subtle
differences between confusing categories.
For mixed subset classification, we propose to use random forest classifier, which
is a powerful one with feature selection capability but requires much less training
time.
We conduct extensive experiments to study the contributions of different modules,
which give a better understanding of the whole system.
1.4 Organization of the Dissertation
The rest of the dissertation is organized as follows. The background machine learning
techniques closely related to this work, such as Convolutional Neural Network, Canon-
ical Correlation Analysis, and Structured Prediction, are introduced in Chapter 2. Then,
17
the MIR system with automatic object tag importance prediction is proposed in Chap-
ter 3. The extended MIR system that jointly considers the scene and object tag impor-
tance is introduced in Chapter 4. Confusing Categories Identification and Resolution
object classification system is presented in Chapter 5. All Chapter 3, Chapter 4, and
Chapter 5 will include extensive experimental results to demonstrate the superior per-
formance of our proposed system. Finally, concluding remarks and future works are
pointed out in Chapter 6.
18
Chapter 2
Background Review
In this chapter, we will go over three machine learning algorithms closely related to this
work. They are Convolutional Neural Network (CNN), Canonical Correlation Analysis
(CCA), and Structured Prediction.
2.1 Deep Learning and Convolutional Neural Network
Deep learning, aka. Deep Neural Network, is an evolved version of neural network
with many hidden layers and optimized by back propagation. The first work that aimed
at applying neural network to solve visual recognition problem was proposed in [60],
where the Convolutional Neural Network was proposed to solve document recognition
task. Despite its good performance, the training of such network was very tricky and
slow due to problems such as gradient diffusion. It was in 2006 that deep architecture
started to gain attention due to greedy layer wise training [6] and deep belief network
[40]. After that, more work has been proposed in the deep learning research field and
many of them were particularly related to the object classification task [23, 25, 50, 110].
The work [53] was a milestone which drawed attention of computer vision researchers
to the deep learning method. Interested reader can refer to [5, 34] for a detail overview
of these methods. Even though there are many different deep learning architectures,
it is the deep Convolutional Neural Network (CNN) that performs the best in visual
recognition problems.
19
The key reasons for the success of deep CNN can be attributed to the following 3
points:
1. Non-linearity and cascade of many layers. Due to the non-linearity of activa-
tion function, neural network behaves like non-linear classifier. With many layers
of non-linearity cascaded together, the model is super powerful in drawing com-
plex decision boundary in the high dimensional feature space. [5] pointed out
that without the cascading structure, the shallow network would need exponential
amount of neurons to achieve the same model complexity as the deep network.
In other words, it is an efficient tradeoff between “width” and “depth” of network
structure.
2. Massive amount of training data. With such high model complexity, the learnt
model will easily suffer from over fitting if there is no sufficient training data.
Fortunately, with the advancement of information era, image data, even annotated
image data, is no longer a scarce resource. With Amazon Mechanical Turk, the
data can be obtained at a relatively cheap price. This results in current million-
images-sized dataset such as [20,115]. With such large amount of training data in
hand, the problem of over fitting can be alleviated to certain extent.
3. Efficient GPU implementation. This reason is particularly true for computer
vision problem, which is a computational intensive research field. Moreover, the
convolution of image data with filter is a process that can be easily parallelized.
This makes the GPU implementation of CNN particular useful and it greatly short-
ens the training time.
In the below, we will briefly introduce CNN and some already existed CNN struc-
tures.
20
To apply neural network to images, a straightforward way would be treating each
mn image as an mn dimensional vector and feeding this vector into the neural
network. Consider a simple neural network with only one layer of hidden units and
the size of hidden units is k, then we have to train a weight matrix with mnk
dimension. Suppose the image size is 96 96 and we want to train 400 features, then
the weight vector will have size of 3686400. Also, the computational cost is very high
for calculating the activation of each neuron in the hidden layers. Moreover, this naive
approach ignores the stationary property of natural images, which means that natural
images usually exhibit same statistics among different parts of images. To address this
problem, the concept of convolutional neural network (CNN) was proposed in [60] and
it is widely used in deep learning methods when applied to large images today.
Specifically, the basic building blocks of CNN include the convolutional layer and
the pooling layer. Figure 2.1 shows an example of basic convolutional neural network.
This basic CNN consists of one convolutional layer, one pooling layer and a multinomial
logistic regression layer (for classification purpose).
For the convolutional layer, the key idea is the weight sharing and locally connection.
This means that the small patches within the input data will share the same weight
matrix when input to the convolutional layer. To obtain an output for a hidden unit,
the weight matrix will be shifted over the entire image and thus a map smaller than the
original input image will be generated. For example, if the input image ismm and
the weight matrix covers a ll region. The output map for one hidden unit will be
pml 1qpml 1q. Here we call this map the feature map. This process is
particularly similar to the convolution process in signal processing, and thus this layer
is called the convolutional layer. For a convolutional layer withK hidden units, there
will beK feature maps generated.
21
Convolution Layer
Pooling
Layer
Stacked Feature Vector
Multinomial Logistic Regression
“Cat”
Convolution Filters and Multinomial Logistic Regression weights trained jointly
Figure 2.1: An Example of Convolutional Neural Network
The convolutional layer is usually followed by a pooling layer. In the pooling layer,
the feature map will be divided into smaller non-overlapped blocks and a value summa-
rizing the feature values within this block will be generated and input to the next layer.
This summarization process is called the pooling and help further improve the invari-
ance property of the extracted features. The pooling can be achieved by taking either
max or mean of the values within the block. It can be also considered as a downsam-
pling process of feature map, which will produce a lower resolution version of feature
map.
In the above framework, the parameters needed to be learnt are the convolutional
kernels in the convolutional layer and multinomial logistic regression model. This is
essentially training a neural network and thus can be achieved by back propagation.
To apply the back propagation, the standard procedure is to first randomly initialize
the weights for convolutional kernel and multinomial logistic regression model. Then a
feed forward pass is performed to the whole network. The output will be the probability
22
of each training image belongs to each class. Letp
y
piq
j|x
piq
;
be the probability
of the ith training sample belongs to the jth class under the weight . It is natural to
define the cross entropy function as the cost function of this network, where
JpW;b;q
1
N
N
¸
i1
J
¸
j1
1
y
piq
j
(
logp
y
piq
j|x
piq
;W;b;
Then it can be verified that the error at the output layer
piq
m
for theith training example
can be calculated as:
piq
m
p
piq
g
piq
Wherep
piq
p
y
piq
1|x
piq
;W;b;
.
.
.
p
y
piq
J|x
piq
;W;b;
is the output of the network andg
piq
is the
ground truth vector, i.e, if theith image belong to thekth class, then thekth element in
g
piq
is 1 and the others are all 0. After obtaining the error for the multinomial logistic
regression layer, then the gradient with respect to will be:
BJ
B
1
N
N
¸
i1
piq
m
a
piq
p
T
Where a
piq
p
is the output of the pooling layer for the ith training image. To obtain the
error at the pooling layer
piq
p
, it can be derived as
piq
p
T
piq
m
, here is the multinomial
logistic regression weight in current iteration. Then we have to re-order
piq
p
into the
map form corresponding to the downsampled feature maps in feed forward pass. Let the
re-ordered error 2D map for thejth pooling unit be
piq
pj
. Then the corresponding error
of the convolutional layer
piq
cj
can be calculated using:
piq
cj
upsample
piq
pj
a
piq
cj
1a
piq
cj
23
Here represent the element-wise multiplication and a
piq
cj
represents the output of jth
unit in the convolutional layer.
Finally, the gradient ofW andb for thejth convolutional kernel can be calculated
as:
BJ
Bb
j
1
N
N
¸
i1
¸
uv
piq
cj
uv
and
BJ
BW
j
1
N
N
¸
i1
piq
cj
x
piq
Where is the convolution operation, the subscript uv represents the value in the uth
row and thevthe column of the convolutional error map.
It is worth to mention that training this network is very slow due to the convolution
process in both feed forward and back propagation pass. To calculate the gradient and
update the parameters for one time, convolution is needed in both feed forward and back
propagation steps. Thus, GPU batch processing is now a standard requirement of CNN
training.
Two popular networks that are widely benchmarked with in literatures are the
AlexNet [53] and the VGG16 [89], which are shown in Figure 2.2 and Figure 2.3,
respectively. They consist of 8 and 16 layers respectively. Moreover, the VGG16 has
smaller convolutional filters at the first few layers, which proves to be more effective
than AlexNet’s larger filters. Their performance on the ImageNet ILSVRC2012 dataset
are 36.7% and 27.3% in terms of top-1 error.
More recently, deeper networks with better performance are proposed. The most
distinguished one is ResNet [39]. It has 3 versions with 50, 101 and 152 weight layers,
whose top-1 error rate on the ImageNet ILSVRC2012 dataset are 24.7%, 23.6% and
23.0% respectively.
24
Figure 2.2: AlexNet structure [53].
Figure 2.3: VGG16 structure. Source: https://www.cs.toronto.edu/ frossard/post/vgg16/
2.2 Canonical Correlation Analysis
As mentioned in Chapter 1, in multimodal image retrieval, an image is associated with
both visual feature vectorf
v
(e.g. the histogram of low-level visual words) and textual
feature vectorf
t
(e.g. the tag vector). Given these features pairspf
piq
v
;f
piq
t
q forN images,
Canonical Correlation Analysis (CCA) aims at finding a common subspace that maxi-
mize the canonical correlation between their projections in this subspace.
25
More formally, letx;y denote the inner product, s
v
and s
t
be the unit direction
vector that the visual feature vector f
v
and textual feature vector f
t
will project onto.
Then after projecting into their corresponding direction, the projected feature pairs for
N images becomes
A
s
v
;f
piq
v
E
;
A
s
t
;f
piq
t
E
. In this case,f
v
andf
t
can be considered as
two random vectors, andpf
piq
v
;f
piq
t
q can be seen as their samples.
The canonical correlation between these two random vector can be written as
(assuming both random vectors have zero means):
^
Erxs
v
;f
v
yxs
t
;f
t
ys
b
^
E
xs
v
;f
v
y
2
^
E
xs
t
;f
t
y
2
(2.1)
where the nominator in Eq. (2.1) are the estimated covariance between random vectorf
v
andf
t
using their samples, and the denominator are their estimated standard deviations.
Thus, the CCA aims at findings
v
ands
t
that maximize the canonical correlation in
Eq. (2.1), resulting the optimization problem:
ps
v
;s
t
q argmax
sv;st
^
Erxs
v
;f
v
yxs
t
;f
t
ys
b
^
E
xs
v
;f
v
y
2
^
E
xs
t
;f
t
y
2
argmax
sv;st
^
E
s
T
v
f
v
f
T
t
s
t
b
^
Ers
T
v
f
v
f
T
v
s
v
s
^
Ers
T
t
f
t
f
T
t
s
t
s
argmax
sv;st
s
T
v
^
E
f
v
f
T
t
s
t
b
s
T
v
^
Erf
v
f
T
v
ss
v
s
T
t
^
Erf
t
f
T
t
ss
t
(2.2)
Here the superscript T indicate the transpose. We can write the C
vt
^
E
f
v
f
T
t
as
the estimated cross covariance matrix between visual and textual feature vector,C
v
^
E
f
v
f
T
v
as the estimated covariance matrix for visual feature vector, andC
t
^
E
f
t
f
T
t
26
as that of textual feature vector. Thus, after simplification, the optimization problem
becomes:
ps
v
;s
t
q argmax
sv;st
s
T
v
C
vt
s
t
a
s
T
v
C
v
s
v
s
T
t
C
t
s
t
(2.3)
Observing that multiplying an arbitrary scaling factor tos
v
ors
t
would not change the
canonical correlation. Thus, the unconstrained optimization can be converted to a con-
strained optimization problem:
ps
v
;s
t
q argmax
sv;st
s
T
v
C
vt
s
t
s.t. s
T
v
C
v
s
v
1
s
T
t
C
t
s
t
1
(2.4)
By applying the standard Lagrangian Multiplier method, we could obtain that:
Lp
v
;
t
;s
v
;s
t
qs
T
v
C
vt
s
t
v
2
s
T
v
C
v
s
v
1
t
2
s
T
t
C
t
s
t
1
(2.5)
Taking the gradient ofL with respect to
v
and
t
and setting them to 0, we can obtain
that
v
t
. Using this condition and conduct further simplification (assuming the
covariance matrixC
t
is invertible), we can obtain:
C
vt
C
1
t
C
tv
s
v
2
C
v
s
v
(2.6)
This reduce the problem into a generalized eigenproblem. Further simplification can be
achieved ifC
x
is also invertible (see [37] for more detail).
One drawback of CCA is due to its linearity. It is highly possible that the depen-
dencies between the visual and textual domain are non linear due to the complicated
27
nature of multimedia data. To address this problem, a pair of nonlinear transform
v
and
t
can be used to map visual and textual features into high dimensional spaces,
respectively. With kernel function:
K
v
pf
piq
v
;f
pjq
v
q
v
pf
piq
v
q
T
v
pf
pjq
v
q (2.7)
K
t
pf
piq
t
;f
pjq
t
q
t
pf
piq
t
q
T
t
pf
pjq
t
q (2.8)
v
and
t
are only computed implicitly. KCCA attempts to find the maximally cor-
related subspace with the two transformed spaces. It is proved in [37] that the new
optimization problem is
p;q argmax
;
T
K
v
K
t
a
p1q
T
K
2
v
T
K
v
1
b
p1q
T
K
2
t
T
K
t
(2.9)
where is a regularization parameter.
The relation between solutionsp;q andps
v
;s
t
q are
s
v
N
¸
i1
i
v
pf
piq
v
q; s
t
N
¸
i1
i
t
pf
piq
t
q: (2.10)
This optimization problem can be formulated as a generalized eigenvalue problem and
solved efficiently. The eigenvectorsp
p1q
;
p1q
q; ;p
pTq
;
pTq
q corresponding to top
T largest eigenvalues yield the bases
ps
p1q
v
;s
p1q
t
q; ;ps
pTq
v
;s
pTq
t
q (2.11)
of the subspace, whereT is the desired subspace dimension.
28
Finally, to process a new image query, we project its visual feature vectorf
pqq
v
onto
each basis vector by
s
T
v
v
pf
pqq
v
q
N
¸
i1
i
v
pf
piq
v
q
T
v
pf
pqq
v
q
N
¸
i1
i
K
v
pf
piq
v
;f
pqq
v
q: (2.12)
The projections ontos
p1q
v
; ;s
pTq
v
together form the projected vector in the subspace.
Images in the database will be ranked according to the their normalized correlation with
this projected vector and returned as retrieval results.
2.3 Structured Prediction
Structured prediction [96] is a widely used technique in computer vision research field.
Unlike the traditional prediction (classification) problem, the structured prediction not
only models the relationship between observable variables (features) and output vari-
ables (class label), but also the interactions among observable variables and output vari-
ables, respectively.
The classical example of structured prediction is image segmentation, which is the
first computer vision research problem that structured prediction was applied to. Con-
sider a simple image segmentation task where each pixel in the image need to be either
classified as foreground or background. In this case, each image pixel have observ-
able variables (features) such as its gray level value and color value, and the task is to
assign an output variable (class label of either foreground or background) to every pixel.
Obviously, the features of each pixel provide visual cue to predict the class label. For
example, sky or water like color indicate this pixel is more likely to be background.
Furthermore, there is strong correlation between the class labels of neighboring pix-
els. More specifically, it is very likely that neighboring pixel will take same class label
29
unless there is a very sharp color or gray level change. Thus, structured prediction essen-
tially tries to make trade off between individual pixel features and context neighboring
information in making the prediction.
More formally, an auxiliary evaluation function is defined asg :XYÑR, which
essentially measures the compatibility of pair ofxPX andyPY. For simplicity, here
we usex andy to represent a group of variables. For example,px denote the gray levels
of all pixels in one image, andpy denote their class labels.The prediction function is
defined asf :X ÑY, which maps the input domainX (observable variables/features)
to a structured output domainY. Moreover, given axPX ,f tries to maximize theg
over all possibleyPY. This can be written as:
y
fpxq : argmax
yPY
gpx;yq (2.13)
The auxiliary evaluation functiongpx;yq can be either a probability distribution function
likeppx;yq orppy|xq. In this case, the problem reduce to a maximum likelihood (ML)
or maximum a posteriori (MAP) problem. The probability distribution can be modeled
using probabilistic graphical model including Markov Random Field (MRF) and Con-
ditional Random Field (CRF), which will be introduced in Sec. 2.3. In a more general
case, the auxiliary function can be written as a linear model gpx;yq xw; px;yqy
with parameter vectorw and feature map px;yq. It can be proved that, using graphical
model notation, the linear model and probability representation are actually the same,
just like the relationship between least square and probabilistic interpretation of linear
regression.
In either case, two processes are involved in developing a structured prediction
model: 1) parameter learning, and 2) inference.
30
Parameter learning is the process of fitting the evaluation function g given the
training datatpx
n
;y
n
qu
n1;;N
. This can be achieved by using either probabilis-
tic parameter learning or loss minimizing parameter learning. The probabilistic
parameter learning is usually applied when the evaluation function is based on
probabilistic graphical model. It is essentially fitting the joint or conditional prob-
ability distribution by maximizing the log likelihood of training data using the
gradient descent approach. On the other hand, the loss minimizing approach can
be applied to either types ofg (probability based or linear model based). Its goal is
to minimize the empirical expected loss
°
N
n1
y
n
; argmax
yPY
gpx
n
;yq
. Here
is a customized loss function designed specific to the particular problem. This is
the Structured Support Vector Machine (SSVM) approach that will be introduced
in Sec. 2.3.
Inference has two types: probabilistic inference and Maximum a Posteriori (MAP)
inference. This work will only use MAP inference, which is literally solving the
problem in Eq.(2.13). MAP inference is not only needed in making the final
prediction to testing data, but also useful in loss minimizing parameter learning.
In the following two subsections, I will first introduce basic probabilistic graphical
model that we will use in our work, and then the specific parameter learning tool Struc-
tured Support Vector Machine.
Markov Random Field and Conditional Random Field. The probabilistic graph-
ical model model the joint distribution of multiple random variables using graph struc-
ture. Particularly, the graph encode the conditional independencies, making the prob-
ability representation clear and straightforward. There are two types of probabilistic
graphical model: 1) directed graphical model (Bayesian Network), and 2) undirected
graphical model (Markov Random Field) [64, 73]. Bayesian network is useful when
there are direct causal logic between random variables, but this does not apply to a
31
bunch of computer vision problem such as image segmentation. On the other hand,
Markov Random Field (MRF) is used to model the interaction between multiple ran-
dom variables without forcing the causal order on them, and thus is suitable for our
work.
A MRF defines a joint distribution that satisfy the conditional independencies
defined by a undirected graphGpV;Eq, such that the joint distribution can be factor-
ized as:
ppyq
1
Z
¹
CPCpGq
C
py
C
q (2.14)
HereCpGq represents the set of cliques in graphG, andy
C
is the set of variables that
nodes in cliqueC correspond to.Z is the normalizing factor
Z
¸
yPY
¹
CPCpGq
C
py
C
q (2.15)
that ensures the definition in Eq. (2.14) is a probability distribution. It is also known
as partition function.
C
:Y
C
Ñ R
is known as the potential function or factor for
the clique C and it defines how the random variables in C interact with each other.
The potential function can be defined either manually using prior knowledge or learning
from the data. The probability distribution in Eq. (2.14) can be written in a log linear
form using change of variable trick:
E
C
py
C
qlogp
C
py
C
qq;
C
py
C
q exppE
C
py
C
qq
32
Here we callE
C
the energy function for the cliqueC. Substituting into Eq. (2.14), we
have:
ppyq
1
Z
¹
CPCpGq
C
py
C
q
1
Z
¹
CPCpGq
exppE
C
py
C
qq
1
Z
exp
¸
CPCpGq
E
C
py
C
q
(2.16)
Using the log linear form, we can find out that the now maximizing the probability
becomes a energy minimization problem:
argmax
yPY
ppyq argmax
yPY
1
Z
exp
¸
CPCpGq
E
C
py
C
q
argmax
yPY
exp
¸
CPCpGq
E
C
py
C
q
argmax
yPY
¸
CPCpGq
E
C
py
C
q
argmin
yPY
¸
CPCpGq
E
C
py
C
q
(2.17)
This is the reason why the term energy minimization is so popular in the computer vision
research field. I will also use this term in my work.
The Conditional Random Filed (CRF) [57, 94] extends the Markov Random Field
in that it directly models the conditional distribution ppy|xq, where x is the observ-
able variables like features. More intuitively, the potential function for each clique in
the graph defined in Eq. (2.14) is conditioned globally on the input features. The true
distinction between MRF and CRF is rather ambiguous, and researchers from machine
learning and computer vision field does not have a consistent view for them. In this
33
work, we will follow the energy minimization approach using the SSVM introduced in
next section. Thus, the distinction between MRF and CRF is not important.
Structured Support Vector Machine. Structure Support Vector Machine (SSVM)
[97, 98], like regular Support Vector Machine, is a max margin classifier but applied
to structured output variables. Assuming the linear model based evaluation function
gpx;yqw
T
px;yq, the prediction functionf is parametrized byw and can be written
as:
fpx;wq argmax
yPY
w
T
px;yq (2.18)
Given training datasettx
n
;y
n
u, we can obtain zero loss using this predictor if
@n: max
yPYzyn
w
T
px
n
;yq¤w
T
px
n
;y
n
q (2.19)
whereY is the set of all possible output for training data. Define
n
pyq as:
n
pyq px
n
;y
n
q px
n
;yq (2.20)
Then the zero loss condition becomes:
@n:@yPY w
T
n
pyq¥ 0 (2.21)
Pick up the solutionw that maximizes the margin
, defined as:
min
n
w
T
px
n
;y
n
q max
yPYnzyn
w
T
px
n
;yq
(2.22)
34
Since the margin can be made arbitrarily large by rescalingw, we fix its norm to be 1,
resulting the optimization problem:
max
w:}w}1
s.t. @n:@yPYzy
n
: w
T
n
pyq¥
(2.23)
Using similar trick as that in SVM, we can reformulate the optimization problem as:
min
w
1
2
}w}
2
s.t. @n:@yPYzy
n
: w
T
n
pyq¥ 1 (2.24)
To allow the case where the zero loss cannot be achieved, we relax the constraints by
introducing slack terms
n
for each training sample. This yields
min
w;
1
2
}w}
2
C
N
¸
n1
n
s.t. @n:@yPYzy
n
: w
T
n
pyq¥ 1
n
;
n
¥ 0
(2.25)
In structured prediction, it is better to treat different constraints violation differently.
This can be achieved by scaling either the slack variable or the margin using loss func-
tion. To yield the margin rescaling formulation, we define the margin to be proportional
to the loss. This yields:
min
w;
1
2
}w}
2
C
N
¸
n1
n
s.t. @n:@yPYzy
n
: w
T
n
pyq¥ py
n
;yq
n
;
n
¥ 0
(2.26)
The formation in in Eq. (2.26) is called the N slack variables formation. There are
various sophisticated optimization tool to solve the above optimization problem. Among
them the cutting plane [47] algorithm is the most widely used. Although cutting plane
algorithm for the N slack variables formulation can achieve polynomial time complexity.
35
An one slack variable formulation can speed up the algorithm into linear time. The one
slack variable with the margin rescaling formulation is simply to have a single slack
variable,, instead ofN, but to use|Y|
N
constraints, instead of justN|Y| . This yields
the formulation:
min
w;
1
2
}w}
2
C
s.t.@p y
1
; ; y
N
qPY
N
1
N
N
¸
n1
w
T
n
p y
n
q¥
1
N
N
¸
n1
p y
n
;y
n
q
(2.27)
One can show that the solutionw are the same for one slack variable and N slack vari-
able, with
1
N
°
N
n1
n
.
The pseudo code for the cutting plane algorithm under margin scaling, 1-slack vari-
able formulation is shown below.
Algorithm 1 Cutting plane algorithm for SSVM (margin rescaling, 1-slack variable
formulation)
1: InputDtpx
1
;y
1
q; ;px
N
;y
N
qu;C;;
2:W ;
3: repeat
4: w; argmin
w;¥0
1
2
}w}
2
C;
5: s.t.@p y
1
; ; y
N
qPW
1
N
°
N
n1
w
T
n
p y
n
q¥
1
N
°
N
n1
p y
n
;y
n
q;
6: forn 1 :N do
7: ^ y
n
argmax
~ ynPY
p~ y
n
;y
n
qw
T
px
n
; ~ y
n
q;
8: end for
9: WW
tp^ y
1
; ; ^ y
N
qu;
10: until
1
N
°
N
n1
p^ y
n
;y
n
q
1
N
w
T
°
N
n1
n
p^ y
n
q¤
11: return w;
It is worth mentioned that the step in line 7 in Algorithm 1need to find the the struc-
tured output that maximize the summation of loss and evaluation function. This step is
named as “finding the most violated constraint”, which is essentially a MAP inference
36
with a modified evaluation function. This implies that in the parameter learning stage,
the MAP inference is needed as one of the step. Thus inference is a very important step,
and its efficiency not only affect the final prediction stage, but also the speed of learn-
ing. It is this reason that the structured output inference is now still an active research
problem.
37
Chapter 3
Multimodal Image Retrieval with
Object Tag Importance Prediction
Visual cues and texts associated with images (e.g. tags, sentence description) are used
for image search and retrieval in today’s automatic Multimodal Image Retrieval (MIR)
systems. However, all tags are treated as equally important in these systems, and tags
do not boost the retrieval performance substantially. In this chapter, we first present a
method that measures the relative importance of object tags for images with sentence
descriptions. Next, we propose an object tag importance prediction model by exploiting
visual, semantic and context cues. Then, a Canonical Correlation Analysis (CCA) is
conducted to learn the relation between the image visual feature vector and object tag
importance to enable robust retrieval performance. Experimental results show signif-
icant improvements of the proposed MIR with Tag Importance Prediction (MIR/TIP)
system over traditional MIR systems that ignore tag importance.
3.1 Motivation
The multimodal approach [12,32,42,43,80,111] has gained popularity in image retrieval
research in recent years due to the availability of user provided tags, sentence descrip-
tions or even paragraphs [12]. As the most widely used textual information in image
retrieval, tags possess critical semantic information to reduce the semantic gap. How-
ever, tags provided by humans are usually noisy, and they might not have any relevance
38
to actual image content [67]. Furthermore, even if a tag is related to image content,
the content it represents might not be perceived as important by humans. Intuitively
speaking, capturing and incorporating human perceived tag importance should improve
the performance of automatic MIR systems, which will be clearly demonstrated in this
work.
When people are asked to describe an image, they tend to be highly selective and
focus on important content in the image. Thus, sentence descriptions serve as a natural
indicator of tag importance. Berg et al. [7] attempted to model the tag importance as a
binary value, i.e. a tag is important if and only if it appears in a sentence. However, a
binary-valued tag importance cannot capture the relative importance of tags well, which
degrades the retrieval performance of MIR systems. To address this deficiency, we
study the relative tag importance prediction problem and incorporate the predicted tag
importance into an MIR system, leading to the MIR with Tag Importance Prediction
(MIR/TIP) system.
Our proposed system has three major contributions. First, a technique based on the
discounted probability is developed to measure tag importance. This measured impor-
tance will serve as the ground truth for evaluating the predicted tag importance. Second,
a novel structured model is proposed to predict tag importance by integrating visual,
semantic and context cues. While the first two were explored before [7, 92], the con-
text cue has never been considered for tag importance prediction in previous work. We
will show significant improvement in tag importance prediction using the context cue,
which in turn results in improved image retrieval performance. Finally, the Canonical
Correlation Analysis (CCA) is adopted to incorporate predicted tag importance in the
proposed MIR/TIP system. We use two datasets to conduct experiments for tag impor-
tance prediction and MIR. They are the UIUC Pascal Sentence (UIUC) dataset [79] and
39
Good Retrieval
Bad Retrieval
• A parked bike is equipped with a laundry
basket.
• A dog laying next to a red two seat bicycle
• A dog sitting next to a double seated red
bicycle.
• a dog lying on the ground next to a red
bicycle with a laundry basket attached.
• This is an image of a dog and a double
bicycle.
dog
bicycle
person
table
0.4
0.6
0
0
Ground Truth Importance
KCCA semantic subspace
dog, bicycle, person, table
Structured Prediction
Sample Database Image
Predicted Importance
dog
bicycle
person
table
0.5
0.4
0.1
0
Importance Measurement Importance Prediction Multimodal Retrieval
Query Image
Figure 3.1: An overview of the proposed MIR/TIP system. Given a query image with
important “dog” and “bicycle”, the MIR/TIP system will rank the good retrieval exam-
ples with important “dog” and “bicycle” ahead of bad retrieval ones with less important
“dog” and “bicycle”.
the Microsoft Common Objects in Context (COCO) dataset [11, 65]. The superior per-
formance of the proposed MIR/TIP system against traditional MIR systems is clearly
demonstrated.
3.2 Overview of Proposed System
A high-level description of the proposed MIR/TIP system is given in Figure 3.1. It
consists of the following three stages (or modules):
1. Tag importance measurement;
2. Tag importance prediction; and
3. Multimodal retrieval assisted by predicted tag importance.
It is assumed in our experiments that there are three types of images on the web:
A images with human provided sentences and tags;
B images with human provided tags only; and
40
C images without any textual information.
In the tag importance measurement stage, images in Type A are used to obtain the
measured tag importance, which will serve as the ground truth tag importance. In the
tag importance prediction stage, images in Type A are first used as the training images
to create a tag importance prediction model. Then, tag importance for images in Type B
will be predicted based on the learned model. Finally, images in Types A and B will be
used as training images to learn the CCA semantic subspace in the multimodal retrieval
stage. Images in Type C will serve as test images to validate the performance of our
MIR/TIP system.
3.3 Measuring Tag Importance
Measuring human-perceived importance of an object is an important yet challenging
task in MIR. Researchers attempted to measure the importance of tags associated with
images from two sources: 1) human provided ranked tag lists [42,43,92], and 2) human
sentence descriptions [7]. The major drawback of ranked tag lists is its availability. Tags
are rarely ranked according to their importance, but rather listed randomly. Obtaining
multiple ranked tag lists from human (using Amazon Turk) is labor intensive and thus
not a feasible solution. In contrast, human sentence descriptions are easier to get since
they can be collected from users’ comments in the social photo sharing website. For this
reason, we adopt the human sentence description as the source to measure tag impor-
tance. Clearly, the binary-valued tag importance as proposed in [7] cannot capture rela-
tive tag importance in an image. For example, both “person” and “motorbike” in Fig. 3.2
are important since they appear in sentences. However, as compared with “person” that
appears in all five sentences, “motorbike” only appears twice. This shows that humans
41
Importance: car: 0; person: 0.8; motorbike: 0.2
• A man sitting on a porch with two motor
scooters parked outside.
• A man with his cheeks pushed out and two
scooters to the left.
• A young man holding his breath.
• A young man puffs out his cheeks in an
outdoor cafe.
• A young man with a silly look on his face.
Object tags: car, person, motorbike
Figure 3.2: An example of object importance measure using human sentence descrip-
tions, where object tags “person” and “motorbike” appear in human sentences in terms
of their synonyms: “man” and “scooter”.
perceive “person” as more important than “motorbike” in Fig. 3.2. Thus, tag importance
should be quantified in a finer scale rather than a binary value.
Desired tag importance should serve the following two purposes.
1. Within-image comparison. Tag importance should teach the retrieval system to
ignore unimportant content within the image.
2. Cross-image comparison. Given two images with the same object tag, tag impor-
tance should identify which image has a more important version of that object tag.
To achieve these objectives, one heuristic way is to define the importance of an object
tag in an image as the probability for it to be mentioned in a sentence, which is called
probability importance. To give an example, for the left image in Fig. 3.3 (i.e. the
sample database image in Fig. 3.1), the importance of “dog” and “bicycle” are 0.8 and 1
since they appear in four and five sentences, respectively. While this notion can handle
within-image comparison, it fails to model cross-image comparison. For instance, as
compared with the right image in Fig. 3.3, where the “bicycle” is the only tag appearing
in all five sentences, it is clear that the “bicycle” in the left image is less important.
However, its probability importance has the same value, 1, in both images.
42
Figure 3.3: An example for comparison between probability importance and discounted
probability importance, where the “bicycle” in both images are equally important with
probability importance but not with the discounted probability importance (left “bicy-
cle”: 0.6; right “bicycle”: 1.0).
To better handle cross-image comparison, we propose a measure called discounted
probability importance. It is based on the observation that different people describe an
image in different levels of detail. Obviously, tags mentioned by detail-oriented people
should be discounted accordingly. Mathematically, discounted probability importance
of object tagt in thenth imageI
n
ptq is defined as
I
n
ptq
Kn
¸
k1
I
!
tPT
pkq
n
)
T
pkq
n
K
n
; (3.1)
whereK
n
is the total number of sentences for thenth image,I is the indicator function,
andT
pkq
n
is the set of all tags in the kth sentence of the nth image. An example of
measured tag importance using Eqn. (3.1) is shown in Fig. 3.2. Also, for the bicycles in
Fig. 3.3, the measured tag importance using discounted probability are 0.6 (left) and 1
(right). In the following, we only use the discounted probability importance as measured
tag importance, which serves as the ground truth for our experiments in tag importance
prediction.
43
3.4 Predicting Tag Importance
The object tag importance prediction problem is examined in this section. First, we
discuss three feature types used for prediction; namely, object semantic, visual and con-
text cues. Then, we describe a structured tag importance prediction model, in which
inter-dependency between tag importance is characterized by the Markov Random Field
(MRF) [28]. The model parameters are learned using the Structural Support Vector
Machine (SSVM) [97].
3.4.1 Three Feature Types
Semantic Features. As pointed out in [7], some object categories are more attractive
to human than others. For example, given tags as “cat”, “chair” and “potted plant” for
an image in Fig. 3.4a, people tend to describe “cat” more often than “chair” and “potted
plant”. Fig. 3.5 shows the statistics of tag importance of all 20 object categories in the
UIUC dataset. It is clear that some object categories (e.g. “aeroplane”, “bird”, “cow”,
etc.) are generally more important than others (e.g. “bottle”, “chair”, “pottedplant”,
etc.). This semantic cue can be modeled as a categorical feature. It is a|C|-dimensional
binary vector with value one in theith index indicating theith object category, where|C|
denotes the number of different object categories. This results in 20-D and 54-D feature
vectors for the UIUC and the COCO datasets, respectively.
Visual Features. Obviously, human will not consider an object important just
because of its category. As indicated by the red bars in Fig. 3.5, the variance of impor-
tance among different object categories can be large. For example, both images in
Fig. 3.4b have tags “bus” and “person”, but their importance differ because of their
image visual properties. It is clear that the size, location and occlusion information
(i.e. visual properties) play a role in determining objects’ importance. To capture visual
44
cat: 0.8; chair: 0.2; potted plant: 0
(a) Semantic
bus: 0.2; car: 0; person: 0.8 bus: 0.9; car: 0; person: 0.1
(b) Visual
dog: 0.6; sofa: 0.4 sofa: 0.8; tv monitor: 0.2
(c) Object Context
Figure 3.4: Three object importance cues where texts below images give the ground
truth tag importance.
cues, we first apply Faster R-CNN [81] (with RPN as object proposal network and Fast
R-CNN with VGG16 as detector network) to extract object tags’ corresponded bounding
boxes, and then calculate the following properties using the detected bounding boxes:
45
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
0.0
0.2
0.4
0.6
0.8
1.0
Mean and Standard Deviation of Importance
for each Object Category (UIUC)
Importance Mean
Importance std
Figure 3.5: The mean and the standard deviation of the ground truth tag importance for
different object categories in the UIUC dataset.
1) the area and log(area) as the size features; 2) the max-, min- and mean-distances to
the image center, the vertical mid-line, the horizontal mid-line, and the third box [92] as
location features; and 3) relative saliency. For the last item, we use the spectral residual
approach in [41] to generate the saliency map. Even though false detection will affect
the tag importance prediction, it can be reduced significantly by simply removing the
proposal whose object category is not in the tag list of current image. The performance
loss of tag importance prediction caused by object detection error will be studied in Sec-
tion 3.6. By concatenating all above features, we obtain a 15-D visual feature vector.
An object tag may correspond to multiple object instances in an image. For this case,
we add the size and saliency features of all object instances to obtain the correspond-
ing tag size and saliency features, but take the minimum value among all related object
instances to yield the tag location feature.
Object Context Features. The object context features are used to characterize how
the importance of an object tag is affected by the importance of other object tags. When
two object tags coexist in an image, their importance is often inter-dependent. Consider
the sample database image in Fig. 3.1. If the “dog” did not appear, the “bicycle” would
46
be of great importance due to its large size and centered location, and the discounted
probability importance of “bicycle” would be 1 based on Eqn. (3.1). To model this
kind of interdependency, we should consider not only relative visual properties between
two object tags but also their semantic categories. Fig. 3.4c shows two object context
examples. For the left image, although the “sofa” has a larger size and a better location,
people tend to describe the “dog” more often since the “dog” gets more attention. On the
other hand, for the right image, people tend to have no semantical preference between
“sofa” and “TV monitor”. However, due to the larger size of the “sofa”, it is perceived
as more important by human. Fig. 3.6 shows all 6 tag pairs of the exemplary database
image in Fig. 3.1. For the “bicycle” and “dog” pair, their importance are about the same
since the “bicycle” is more visually important while the “dog” is more semantically
important. However, for the “bicycle” and “person” pair, their visual properties are so
different that the “bicycle” is more important, although the “person” is semantically
attractive.
To extract an object context feature, we conduct two tasks: 1) analyze the relative
visual properties within an object tag pair, and 2) identify the tag pair type (i.e. semantic
categories of two object tags that form a pair). To model the difference of visual prop-
erties for tag pairpt
i
;t
j
q, we uses
i
s
j
andd
i
d
j
as the relative size and location,
respectively, wheres andd denote the bounding box area and the corresponding mean
distance to the image center as described in the visual feature section. The final object
context featureg
o
ij
for tag pairpt
i
;t
j
q is defined as
g
o
ij
ps
i
s
j
qp
T
ij
pd
i
d
j
qp
T
ij
T
; (3.2)
wherep
ij
is the tag pair type vector for tag pairpt
i
;t
j
q.
47
center
left border
corner
middle down
center
>
center
>
middle down left border
<
corner middle down
left border corner
>
≈
≈
Figure 3.6: Object pairs with their visual properties (size, location) and semantic cat-
egories in the exemplary database image in Fig. 3.1, where “¡”, “ ” and “” denote
higher, lower and about equal relative importance between two objects, respectively.
3.4.2 Tag Importance Prediction Model
The interdependent nature of tag importance in an image naturally defines a structured
prediction problem [76], which can be mathematically formulated using the MRF [28]
model. Each image can be represented as an MRF. Fig. 3.7 shows an exemplary MRF
modelpV;Eq built for the sample database image in Fig. 3.1, where each vertexvP V
represents an object tag and each edgee P E represents a pair of object tags. Specif-
ically, each object tag located in the vertex has its own cues (visual and semantic) to
predict importance while each edge enforces the output to be compatible with the rela-
tive importance between object tags.
Under a log-linear MRF model, the energy function to be minimized can be
expressed as
EpX;G;y;wq
¸
iPV
w
T
V
'
V
px
i
;y
i
q
¸
pi;jqPE
w
T
E
'
E
pg
ij
;y
i
;y
j
q; (3.3)
48
center
left border
corner
middle
down
Figure 3.7: The MRF model for the exemplary database image in Fig. 3.1.
whereyty
i
u is the predicted tag importance output vector,Xtx
i
u is the concate-
nation of visual and semantic feature vectors as described in Section 3.4.1,Gtg
ij
u
is the context feature vector calculated using Eqn. (3.2), andw
w
T
V
w
T
E
T
is the
weight vector learned from the training data. '
V
and'
E
are joint kernel maps [2]
defined as
'
V
px
i
;y
i
qx
i
bpy
i
q; (3.4)
'
E
pg
ij
;y
i
;y
j
qg
ij
bpy
i
y
j
q; (3.5)
whereb is the Kronecker product,
py
i
qrIty
i
0u;Ity
i
0:1u; ;Ity
i
1us;
py
i
y
j
qrIty
i
y
j
1u; ;Ity
i
y
j
1us;
49
and whereI is the indicator function. It is worthwhile to mention that the ground truth
tag importance usually takes certain discrete values in experiments, and it does not affect
the retrieval performance if it is rounded to the nearest tenth. Thus, the ground truth tag
importance is quantized into 11 discrete levels, starting from 0 to 1 with interval of 0.1.
This leads to an 11-Dpy
i
q vector and a 21-Dpy
i
y
j
q vector. The reason to define
py
i
y
j
q in such a form is because 1) relative importance can be quantified; 2) the
dimensionality ofw
E
vector can be greatly reduced.
The above model can be simplified to yield binary-valued tag importance as in
[7]. This is achieved by treating y
i
as a binary class label and redefiningpy
i
q and
py
i
y
j
q. The performance of this simplified model will be reported in the first half
of in Section 3.6.3.
Learning. To learn model parameter w, one straightforward way is to apply the
probabilistic parameter learning approach [76], i.e., treating the energy function in Eq.
(4.1) as the negative of the log likelihood of data and applying gradient-based optimiza-
tion. However, this learning approach ignores the ordinal nature of the output impor-
tance label. For example, if a tag has ground truth importance value 1, the predicted
importance 0 will be penalized the same as the predicted importance 0.9. This clearly
deviates from intuition. On the other hand, the Loss Minimizing Parameter Learning
approach such as the Structural Support Vector Machine (SSVM) [97] allows a cus-
tomizable loss function for different prediction tasks. It can be exploited by taking the
ordinal nature of output importance label into account. As a result, we use the SSVM to
50
learn weight vectorw and adopt one slack variable with the margin rescaling formula-
tion in [47]. The optimization problem becomes
min
w;
1
2
}w}
2
C
s.t.@p y
1
; ; y
N
qPY
N
1
N
N
¸
n1
w
T
n
p y
n
q¥
1
N
N
¸
n1
p^ y
n
; y
n
q
(3.6)
where ^ y
n
is the ground truth tag importance vector for thenth training image,Y
N
is the
set of all possible output for the training dataset,
n
p y
n
q pX
n
;G
n
; ^ y
n
q pX
n
;G
n
; y
n
q;
and
pX;G;yq
°
iPV
'
V
px
i
;y
i
q
T
°
pi;jqPE
'
E
pg
ij
;y
i
;y
j
q
T
T
:
In the weight learning process, we define the following loss function
p^ y; yq
1
|V|
¸
iPV
|^ y
i
y
i
|; (3.7)
which is the Mean Absolute Difference (MAD) between the ground truth and the pre-
dicted tag importance values of one image. We applied the standard cutting plane algo-
rithm [47] to optimize and obtain the final weight vectorw.
Inference. After learning weight vector w, we can determine the vector, y, that
minimizes Eqn. (3.3). Moreover, finding the maximum violated constraint in the cutting
plane training also needs inference. Despite the fully connected graph structure in the
MRF model, the number of object tags in an image is usually limited. As a result,
even we try all possible outputs, the computational complexity is still acceptable. Our
51
experiment results show that inference takes approximately only 0.2s per image in C on
a 2.4GHz CPU 4GB RAM PC. For this reason, we adopt the exact inference approach
in this work, and leave the fast inference algorithm to the future.
3.5 Multimodal Image Retrieval
In this section, we discuss our MIR/TIP system by employing CCA/KCCA. First, we
will review the CCA and KCCA. Then we will describe the visual and textual features
used in our MIR/TIP experiments. Note that other learning methods can also be used in
place of CCA/KCCA.
3.5.1 CCA and KCCA
In multimodal image retrieval, an image is associated with both visual feature vectorf
v
and textual feature vectorf
t
(e.g. the tag vector). Given these feature pairspf
piq
v
;f
piq
t
q for
N images, two design matricesF
v
PR
NDv
andF
t
PR
NDt
can be generated, where
theith row inF
v
andF
t
correspond tof
piq
v
andf
piq
t
respectively. CCA aims at finding
a pair of matricesP
v
P R
Dvc
andP
t
P R
Dtc
that project visual and textual features
into a commonc dimensional subspace with maximal normalized correlation:
max
Pv;Pt
trace
P
T
v
F
T
v
F
t
P
t
s.t. P
T
v
F
T
v
F
v
P
v
I; P
T
t
F
T
t
F
t
P
t
I:
(3.8)
The above optimization problem can be reduced to a generalized eigenvalue prob-
lem [37], and the eigenvectors corresponding to the largest c eigenvalues are stacked
horizontally to formP
v
andP
v
.
To measure the similarity of projected features in subspace to achieve cross-modality
retrieval, we adopted the Normalized CCA metric proposed in [32, 33]. After solving
52
the CCA problem in Eq. (3.8), the similarity between visual features F
v
and textual
featuresF
t
will be computed as:
pF
v
P
v
diagp
t
1
; ;
t
c
qqpF
t
P
t
diagp
t
1
; ;
t
c
qq
T
}F
v
P
v
diagp
t
1
; ;
t
c
q}
2
}F
t
P
t
diagp
t
1
; ;
t
c
q}
2
; (3.9)
where
1
; ;
c
correspond to the topc eigenvalues, andt is the power of the eigen-
values (we sett=4 as in [32, 33]).
To model nonlinear dependency between visual and textual feature vectors, a pair of
nonlinear transforms,
v
and
t
, are used to map visual and textual features into high
dimensional spaces, respectively. With kernel functions
K
m
pf
piq
m
;f
pjq
m
q
m
pf
piq
m
q
T
m
pf
pjq
m
q mv;t
v
and
t
are only computed implicitly. This kernel trick lead to KCCA, which attempts
to find the maximally correlated subspace with the two transformed spaces [37]. How-
ever, since the time and space complexity of KCCA isOpN
2
q, it is not practically appli-
cable to large scale image retrieval. We thus adopt the CCA for the retrieval experiments.
We also tried out the scalable KCCA proposed in [32] by constructing the approx-
imate kernel mapping but find almost no performance improvement in our experiment
setting.
3.5.2 Retrieval Features
Features used in CCA for MIR/TIP and experimental settings are given below.
Visual Features. To capture object properties, we use the VGG16 trained on the
ImageNet [53] to extract visual features. The output of the FC7 (Fully Connected layer
7) forms a 4096-D visual feature vector.
53
Textual Features. We consider 5 types of textual features, i.e. 1) the tag vector, 2)
the predicted binary-valued tag importance vector, 3) the true binary-valued tag impor-
tance vector, 4) the predicted continuous-valued tag importance vector, and 5) the true
continuous-valued tag importance vector, each of which is used in a retrieval experi-
mental setting as discussed in Section 3.6.
3.6 Experimental Results
In this section, we first discuss the datasets and our experimental settings. Then, we
compare the performance of different tag importance prediction models. Finally, exper-
iments on three retrieval tasks are conducted to demonstrate the superior performance
of the proposed MIR/TIP system.
3.6.1 Datasets
We adopt UIUC [79] and full COCO [65] datasets for small and large scale experi-
ments respectively. Moreover, the UIUC [79] dataset was used for image importance
prediction in [7] and can serve as a convenient benchmarking dataset. Each image in
these two datasets is annotated by 5 sentence descriptions and each object instance in an
image is labeled with a bounding box. The UIUC dataset consists of 1000 images with
objects from 20 different categories. The COCO dataset has in total 123,287 images
with objects from 80 categories. We use object categories as object tags. Thus, UIUC
has approximately 1.8 tags per image while COCO has on average 3.4 tags per image.
We mainly use UIUC dataset for importance prediction experiment since it is not a chal-
lenging dataset for retrieval.
54
3.6.2 Retrieval Experiment Settings
As introduced in Section 3.5, different tag features correspond to different MIR experi-
ment settings because the semantic subspace is determined by applying CCA to visual
and textual features. We thus compared the following 5 MIR settings:
Traditional MIR: Textual features are the binary-valued tag vectors. This is the
benchmark method used in [32, 42, 43].
MIR/PBTI: Textual features are Predicted Binary-valued Tag Importance vec-
tors. This corresponds to the predicted importance proposed in [7].
MIR/PCTI: Textual features are Predicted Continuous-valued Tag Importance
vectors. This is our proposed system.
MIR/TBTI: Textual features are True Binary-valued Tag Importance vectors.
This serves as the upper bound for the binary-valued tag importance proposed
in [7].
MIR/TCTI : Textual features are True Continuous-valued Tag Importance vec-
tors. This gives the best retrieval performance, which serves as the performance
bound.
Among the above five systems, the last two are not achievable since they assume the tag
importance prediction to be error free.
Moreover, we evaluate our system in terms of 3 retrieval tasks:
I2I (Image to Image retrieval): Given a query image, the MIR systems will project
the visual features into the CCA subspace and rank the database images according
to Eq. (3.9). We also test a baseline retrieval system (Visual Only) that ranks the
database images using visual features’ Euclidean distance.
55
T2I (Tag to Image retrieval): Given a tag list, the MIR systems will project the
tag feature into the CCA subspace and rank the database images according to Eq.
(3.9). Note our system can support weighted tag list as query as in [32], in which
the weights represent the importance of tags.
I2T (Image annotation): Given a query image, the MIR systems will find 50
nearest neighbors in the CCA subspace and use their textual features to generate
an average textual feature vectors, based on which the tags in tag vocabulary will
be ranked. We also test a baseline tagging system using deep features to find
nearest neighbors and their corresponding tag vectors to rank tags.
For all retrieval tasks, we adopt the Normalized Discounted Cumulative Gain
(NDCG) as the performance metric since it is a standard and commonly used met-
ric [42, 43, 58]. Moreover, it helps quantify how an MIR system performs. The NDCG
value for the top k results is defined as NDCG@k
1
Z
°
k
i1
2
r
i1
log
2
pi1q
; where r
i
is a
relevance index (or function) between the query and the ith ranked image, and Z is
a query-specific normalization term that ensures the optimal ranking with the NDCG
score of 1. The relevance index measures the similarity between retrieved results and
the query in terms of ground truth continuous-valued tag importance, i.e. whether an
MIR system can preserve important content of the query in retrieval results or not. For
I2T retrieval task, the relevance of a tag to the query image is set as its ground truth
continuous-valued tag importance. For the choice of the relevance index of the other
two tasks, we define it as the cosine similarity between the ground truth tag importance
vectors of query and retrieved images. The choice of the relevance index will be justified
by subjective test results in Sec. 4.4.2.
56
3.6.3 Performance of Tag Importance Prediction
To evaluate tag importance prediction performance, we first compare the performance of
the state-of-the-art method with that of the proposed tag importance prediction method
on UIUC dataset. Then, under continuous-valued tag importance setting, we study the
effect of different feature types on the proposed tag importance prediction model, along
with the loss introduced by binary-valued importance for all datasets introduced in Sec-
tion 3.6.1.
For the purpose of performance benchmarking, we simplified our structured model
as discussed in Sec. 3.4.2 to achieve binary-valued tag importance prediction and com-
pared with [7]. Here we use accuracy as the evaluation metric. Same as in [7], accuracy
is defined as the percentage of correctly classified object instances against the total num-
ber of object instances in test images. For fair comparison, we also ran 10 simulations
of 4-fold cross validation, and compared the mean and standard deviation of estimated
accuracy. The performance comparison results are shown in Table 3.1. The baseline
method simply predicts “yes” (or important) for every object instance while the next
column refers to the best result obtained in the work of Berg et al. [7]. Our simplified
structured model can further improve the prediction accuracy of [7] by 5.6%.
Table 3.1: Performance comparison of tag importance prediction.
Methods Baseline Berg et al. [7] Proposed
Accuracy Mean 69.7% 82.0% 87.6%
Accuracy STD 1.3 0.9 0.7
Next, we evaluate the continuous-valued tag importance prediction performance of 7
different models. Here we use prediction error as the evaluation metric, which is defined
as the average MAD in Eq. (3.7) across all test images. These 7 models are:
1. Equal importance of all tags (called the Baseline);
57
2. Visual features only (denoted by Visual);
3. Visual and semantic features (denoted by Visual+Semantic);
4. Visual, semantic and context features (denoted by our Model);
5. Visual, semantic and context features with ground truth bounding boxes (denoted
by our Model/True bbox);
6. Equal importance of tags that are mentioned in any sentence. This corresponds
to the true binary-valued tag importance computed as in [7] (denoted by binary
true);
7. Equal importance of tags that are predicted as “important” using the model pro-
posed in [7] (denoted by binary predicted).
For the 2nd and 3rd models, we adopt the ridge regression models trained by visual
features and visual plus semantic features, respectively. The 4th model is the proposed
structured model as described in Section 3.4.2. These 3 models help us understand the
impact of different feature types as described in Sec. 3.4.1. The 5th model differs from
the 4th model in that it uses ground truth bounding boxes to compute the visual features.
It enables us to identify how object detection error will affect the tag importance predic-
tion. For the 6th and 7th models, they are used to quantify how the binary-valued tag
importance (i.e. treating important tags as equally important) results in tag importance
prediction error, which serves as an indicator of performance loss in retrieval. Specifi-
cally, the 6th and 7th models will generate the true and predicted important tags using
the method proposed in [7], respectively. Then, the important tags within the same
image will be treated as equally important and assigned the same continuous-valued
importance. Note that the 6th model is not achievable but only serves as the best case
for the 7th model.
58
We used 5-fold cross validation to evaluate these prediction models. The prediction
errors for UIUC and COCO datasets are shown in Figs. 3.8(a) and (b), respectively.
These figures show that our proposed structured prediction model (4th) can achieve
approximately 40% performance gain with respect to the baseline (1st). For COCO
dataset, we observe performance gain of all 3 feature types, among which visual, seman-
tic, and context features result in approximately 28%, 11%, and 11% prediction error
reduction respectively. By comparing the 4th to the 5th model, we find the object detec-
tion error only results in approximately 3% performance loss. Moreover, it is noted
that even true binary-valued tag importance lead to non negligible prediction error by
ignoring relative importance between tags, and this error will propagate to predicted
binary-valued tag importance model, resulting in 48% performance loss over our pro-
posed model on COCO dataset.
The tag importance prediction performance on UIUC dataset differs from that of
COCO in two parts: 1) the visual features results in 1% performance loss compared
with baseline; 2) the binary-valued tag importance has less performance loss compared
to COCO . The above phenomena were caused by the bias of the UIUC dataset, in which
454 out of 1000 images have only one tag, and 415 of them have tag with importance
value 1. This bias makes modeling the relative importance between tags within the same
image insignificant. Thus, the baseline and binary-valued tag importance based models
can achieve reasonable performance on UIUC dataset but not on COCO datasets.
Fig. 3.9 shows an example of how four models perform on a particular image. With-
out considering object context, the Visual and Visual+Semantic models will treat the
importance of “bus” and “car” as independent, and their predicted importance values
are closely related to the sizes of their bounding boxes.
59
0.00
0.05
0.10
0.15
0.20
0.25
Error
0.156
0.158
0.140
0.100
0.087
0.054
0.135
baseline
visual
visual+semantic
our model
our model/true bbox
binary true
binary predicted
(a) UIUC dataset.
0.00
0.05
0.10
0.15
0.20
0.25
Error
0.231
0.173
0.154
0.134
0.130
0.113
0.192
baseline
visual
visual+semantic
our model
our model/true bbox
binary true
binary predicted
(b) COCO dataset.
Figure 3.8: Comparison of continuous-valued tag importance prediction errors of seven
models: (a) the UIUC dataset, (b) the COCO dataset.
60
Figure 3.9: Results of four importance prediction models on an image with “bus”, “car”,
and “person” tags.
3.6.4 Performance of Multimodal Image Retrieval
In this subsection, we show the retrieval experimental results on COCO dataset using
the settings given in Section 3.6.2. We randomly sampled 10% of images as queries
and the other as database images. Among the database images, 50% were used as MIR
training images. For I2I and I2T experiments, we directly used the image as the query.
For T2I experiment, weighted tag list based on ground truth tag importance vector was
used as query.
I2I Results. The NDCG curves of COCO dataset are shown in Fig. 3.10. We have
the following observations from the plot. First, we find significant improvement of all
MIR systems over visual baseline. It seems deep features can give reasonable perfor-
mance on retrieving the most similar image, but its efficiency lags behind MIR sys-
tems as K becomes larger. Second, our proposed MIR/PCTI system exhibits consid-
erable improvements over other practical MIR systems, including Traditional MIR and
MIR/PBTI. Specifically, forK 50 (the typical number of retrieved images user is will-
ing to browse for one query), the MIR/PCTI can achieve approximately 14% gain over
visual baseline, 4% over Traditional MIR, and 2% over MIR/PBTI on COCO dataset.
Moreover, our proposed MIR/PCTI system can even match the upper bound of MIR
61
0 20 40 60 80 100
Top k
0.58
0.60
0.62
0.64
0.66
0.68
0.70
NDCG@k
Visual baseline
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 3.10: NDCG curves for image-to-image retrieval. The dashed lines are upper
bounds for importance based MIR systems.
system using binary-valued importance, and it only has 1% performance gap with its
upper bound. Finally, by associating Fig. 3.10 to Fig. 3.8, we can identify that the tag
importance prediction performance roughly correlates to retrieval performance. Thus,
better tag importance prediction leads to better I2I retrieval performance.
Besides objective evaluation, some visual retrieval results from the UIUC and the
COCO datasets are given in Fig. 3.11 and Fig. 3.12, respectively. In these figures, we
show the retrieval results of three MIR systems along with retrieval results obtained by
using only visual modality (which is denoted as Visual only). We see from all four
62
Query (D) (A) (B) (C)
Figure 3.11: Top two retrieved results for two exemplary images in the UIUC dataset,
where the four columns represent four retrieval methods: (A) Visual only, (B) Tradi-
tional MIR, (C) MIR/PCTI, and (D) MIR/TCTI.
results that, for a given query, the MIR systems retrieve more relevant images as com-
pared to the visual only method. Furthermore, MIR/PCTI and MIR/TCTI retrieved
results closely match the query in terms of tag importance.
Consider the first retrieval example of the UIUC dataset in Fig. 3.11, where “bus”
and “car” are the tags of the query image. Clearly, the “bus” obtained by traditional MIR
is not as important as that in the query image due to occlusion ( the 1st ranked result) or
smaller scale (the 2nd ranked result). The same observation applies to the first retrieval
example of the COCO dataset in Fig. 3.12, where the query image has two object tags
“airplane” and “person”. Apparently, the “airplane” is the more important object in the
query image and it is well preserved in retrieved images of MIR/PCTI and MIR/TCTI.
In contrast, some of traditional MIR’s retrieved results either miss the airplane (the
1st ranked result) or retrieve unimportant airplane (the 2nd and the 4th ranked results).
63
These examples illustrate that capturing tag importance plays a significant role in the
MIR.
Another interesting phenomenon is that MIR/PCTI and MIR/TCTI tend to retrieve
images that are more semantically consistent with the query image. This is the case
for the second examples in the UIUC and the COCO datasets. In particular, to retrieve
images similar to the second query in the UIUC dataset, the retrieval method should not
only know that both the “person” and the “dining table” are needed in retrieved images,
but also that the “person” is a bit more important than the “dining table”. Only with
such detail level of importance information can the method retrieve “portrait during
dinner” images. The same situation applies to the second query image in the COCO
dataset, where the “couch” and the “chair” are necessary objects of a “living room”
but not dominant ones. Without such information, the retrieved results will easily have
the object dominating the image without capturing the whole scene (see the 2nd ranked
image of traditional MIR). More visual results are shown in Fig. 3.13 and Fig. 3.14 for
UIUC and COCO dataset respectively.
T2I Results. We show the NDCG curves of T2I results on COCO dataset in
Fig. 3.15, respectively. Our proposed MIR/PCTI model shows consistent superior per-
formance over Traditional MIR and MIR/PBTI on both datasets. Particularly, forK=50,
the MIR/PCTI outperforms the Traditional MIR by 4%, and the MIR/PBTI by 2% on
COCO datase. Moreover, the proposed MIR/PCTI system only has 1% performance
gap with its upper bound.
I2T Results. The results of tagging on COCO is shown in Fig. 3.16. Again, we
observe consistent improvements of MIR/PCTI over Traditional MIR and MIR/PBTI.
Specifically, forK 3 (the typical number of tags each image have in the dataset), the
MIR/PCTI can achieve approximately 8% gain over baseline, 5% over Traditional MIR,
64
Query (A) (B) (C) (D)
Figure 3.12: Top four retrieved results for two exemplary images in the COCO dataset,
where the four columns represent four retrieval methods: (A) Visual only, (B) Tradi-
tional MIR, (C) MIR/PCTI, and (D) MIR/TCTI.
65
Query (D) (A) (B) (C)
Figure 3.13: Top two retrieved results for two exemplary images in the UIUC dataset,
where the four columns represent four retrieval methods: (A) Visual only, (B) Tradi-
tional MIR, (C) MIR/PCTI, and (D) MIR/TCTI.
and 3% over MIR/PBTI on COCO dataset, respectively. More surprisingly, its perfor-
mance can outperform the upper bound of MIR/PBTI and match that of MIR/TCTI.
This suggests that our proposed system can not only generate tags but also rank them
according to their importance.
3.7 Summary
A multimodal image retrieval scheme based on tag importance prediction (MIR/TIP)
was proposed in this chapter. A discounted probability importance concept was intro-
duced to measure the tag importance from human sentence descriptions and used as the
ground truth. Three types of features (semantic, visual and object context) were iden-
tified and, then, a structured model for tag importance prediction was proposed. It was
66
Query (A) (B) (C) (D)
Figure 3.14: Top four retrieved results for two exemplary images in the COCO dataset,
where the four columns represent four retrieval methods: (A) Visual only, (B) Tradi-
tional MIR, (C) MIR/PCTI, and (D) MIR/TCTI.
67
0 20 40 60 80 100
Top k
0.79
0.80
0.81
0.82
0.83
0.84
0.85
0.86
0.87
NDCG@k
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 3.15: The NDCG curves for the tag-to-image retrieval on the COCO dataset. The
dashed lines are upper bounds for importance based MIR systems.
demonstrated by experimental results that the retrieval performance of the proposed
MIR/TIP method can be improved significantly over traditional multimodal retrieval
methods that treat image tags as equally important.
68
1 2 3 4 5 6 7 8 9 10
Top k
0.72
0.74
0.76
0.78
0.80
0.82
0.84
0.86
0.88
NDCG@k
Basline
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 3.16: The NDCG curves for the image-to-tag retrieval on the COCO dataset. The
dashed lines are upper bounds for importance based MIR systems.
69
Chapter 4
Multimodal Image Retrieval with
Scene Tag Importance Prediction
In this chapter, we will extend our MIR/TIP framework introduced in Chapter 3 to
include the another two critical semantic components in image: scene tag importance
and object relations. We first address why these two components are indispensable for
a good image retrieval framework. Then, the methodologies for including scene tag
importance and object relations are introduced, respectively. The promising experimen-
tal results confirm the necessity of these two semantic components in a robust MIR
framework.
4.1 Motivation
From the discussion in Chapter 3, we know considering object tag importance can
greatly enhance the image retrieval performance, making the ranking results more mean-
ingful for human. However, in real world problems, images are so versatile that only
considering object tag importance is far from achieving a perfect retrieval system.
Consider the 4 images in the Figure 4.1, where all of them have the object tags “car”,
“person”, and “surfboard”. Despite the same object tags, all 4 images have very different
visual content and thus can not be considered as relevant in retrieval. Clearly, the image
in Figure 4.1 (a) is a scene oriented image, which focus on the whole “beach” scene
structure rather than individual objects. We can even infer that when the photographer
70
(a) (b)
(c) (d)
Figure 4.1: 4 exemplar images with object tags “car”, “person”, and “surfboard”. (a)
A “beach” scene image with unimportant object tags. (b) A image that is not “beach”
scene. (c) A “beach image” with important “car” and “beach”. (d) A “beach” image
important objects.
are taking this picture, he/she is trying to take it as a scenery image. If we compare
the image in Figure 4.1 (a) to (b), we can immediately identify that (b) replace the
most important visual component “beach” scene in (a) with “street” scene, despite that
individual objets are not important in the (b) as well. This problem can be solved by
adding a scene tag besides object tags in our MIR framework. However, as can be
observed from images in Figure 4.1 (c) and (d), a scene type indicator is not enough to
achieve robust retrieval. Despite the all the object tags and scene tags match exactly, the
relative importance between scene and objects are changing in (a), (c) and (d). While
image (a) focus on the scene content, the content of image (c) can be simply described
as “car on beach”, indicating the relative equal importance between object and scene.
Lastly, the importance of objects in image (d) clearly overshadow the scene. Hence,
images (a), (c), and (d) can not be considered as relevant in retrieval either.
71
For these reasons, it is necessary to consider the scene tag importance in the MIR
framework. Following the convention in [65], we use iconic object images, iconic
scene images and non iconic images to represent the 3 classes with different impor-
tance between scene and object. However, our definition is slight different than that
in [65].
Iconic object images: The term iconic object images was first proposed in [8]. This
type of images clearly focus on the objects without sharing too much visual content on
the scene structure. Different from [8], we do not enforce the image to have only one
objects. An example of iconic object image is shown as Figure 4.1 (d).
Iconic scene images: This type of images clearly focus on the overall scene structure
and does not have clear important objects. An example is shown as Figure 4.1 (a).
Non iconic images: This type of images can be usually described as “some objects
on/in some scene”, indicating that both objects and scene are necessary to make the
image meaningful to human. An example is shown as Figure 4.1 (b).
To summarize, merely embedding object tag importance information in MIR frame-
work is not enough to handle the diversity of image. There are more semantic that MIR
can preserve in learning the common subspace, such as scene tag importance and object
relation. As will be demonstrated in this chapter, these two semantics can bring extra
performance gain over the MIR with object importance only system.
72
4.2 Jointly Measuring Scene and Object Tag Impor-
tance
Measuring scene tag importance, like object tag importance, is crucial since it will serve
as the ground truth for later importance prediction. Following the discussion in Sec-
tion 3.3, we still consider using human sentence descriptions to measure the importance
in order to avoid heavy human labor.
4.2.1 Issues for Measuring Scene Tag Importance
To measure scene tag importance, one straightforward way is to use the discounted
probability defined in Eq. (3.1) by treating scene tag as a special object tag. However,
this ad hoc approach will result in underestimation of scene tag importance due to 2
reasons:
Imbalanced number between scene tag and object tag
Consider the image shown in Figure 4.2, which is an iconic “beach” scene image, along
with its sentence descriptions. Treating the scene tag “beach” as a special object tag and
applying the discounted probability measurement, we obtain the importance value for
each tag at the bottom of Figure 4.2. In this case, the scene tag “beach” only has the
importance value 0.33, which is equally important as person since all of them has been
mentioned in the 5 sentences. Clearly, this measurement underestimate the importance
of “beach”, which deviates a lot from human subjective perception.
To address this problem, it is clear that we should not treat the scene tag as the object
tag when measuring the importance. Intuitively, one image can only have one scene tag
(we assume here one image only belongs to one scene type), but may potentially have
as many object tags as it can be. This creates a prior imbalance between the scene
73
Object tags: person, surfboard, chair, umbrella
•
a beach on a sunny day with a bunch of people on it.
•
the sandy beach has people with umbrellas and surfboards.
•
people are enjoying the beach under rainbow umbrellas.
•
a beach with groups of people under colorful umbrellas.
•
people enjoying themselves on at a beach with beach umbrellas and surfboards.
Importance: beach: 0.33; person: 0.33; umbrellas: 0.24; surfboard: 0.1
Scene tag: beach
Figure 4.2: Problem of treating scene tag as a special object tag and applying discounted
probability to measure the tag importance. An iconic “beach” scene image with its five
sentence descriptions and resulted tag importance value.
tag and the object tags, meaning that even if both scene tag and object tags appear in
the sentence, the importance of scene can be discounted to low value if there are many
object tags within the image. This clearly contradicts our human intuition.
Failing to consider grammatical role of scene tag in sentence
Moreover, the sentence provide much more information than merely appearance of tags.
In other words, not only the appearance of tag but also the location and grammar func-
tion affects the importance of tag in a particular sentence. For example, the nouns or
verbs appeared in the main sentence structure will usually considered to be more impor-
tant compared to the words that appeared in modifiers such as preposition phrases.
74
Object tags: person, surfboard
(1) A sandy beach covered in white
surfboards near the ocean.
(2) surfboards sitting on the sand of a
beach
Scene tag: beach
Figure 4.3: An example image with 2 associated sentence descriptions. In sentence
(1), the scene tag “beach” appeared as the main noun while “surfboard” appears in the
preposition phrase that modified the “beach”. In sentence (2), the scene tag “beach”
appeared in the verb phrase that modify the “surfboard”.
Consider the image in Figure 4.3. Its associated tags and 2 sentence descriptions are
listed on the right. Clearly, we can identify that the scene tag “beach” is the major con-
stituent of the sentence (1) because it appears as the main subject of the whole sentence.
On the other hand, “beach” is the minor constituent of sentence (2) because it appears
in the modifier phrases of subject “surfboard”.
Even though the above observation is easy to obtain for human, it is hard for com-
puter to understand sentence. Particularly, sentence structure parsing is still an active
research problem among Natural Language Processing research field. To parse the sen-
tence and infer the tag importance, we first leverage the Stanford Lexicalized Probabilis-
tic Context Free Grammar Parser [68,69] to obtain the sentence constituent structure [9],
which tries to decompose the whole sentence recursively into basic units (constituents)
in a tree structure. The leaves of the tree correspond to the the words in the sentences
and their corresponding Part of Speech (POS) tags [101]. The non leaves nodes repre-
sent the so called constituents, which are basic grammatical function units that compose
the whole sentence. The parent child relations in the tree structure indicate how the
sentence is decomposed into grammar units until it reached the leaves.
75
To understand the sentence constituency tree better, Figure 4.4 shows two trees that
correspond to the two sentences in Figure 4.3.
1. The whole sentence is decomposed into a noun phrase with subject “beach” and
a Verb Phrase consists of a verb and a preposition phrase. The preposition phrase
“in white surfboards” can be further decomposed into preposition “in” and noun
phrase with subject “surfboards”. The latter noun phrase can be further decom-
posed until the tree leaves are reached.
2. The whole sentence is decomposed into a noun phrase with subject “surfboards”
and a Verb Phrase consists of a verb and a preposition phrase. Following the
similar recursive structure in sentence (1), we can see that the tag “beach” locate
at the very bottom part of tree and acts as the modifier of “sand”.
After obtaining the sentence constituent tree, we would like to identify whether the
scene tag “beach” appears as the subject or the modifier of subject. This can be achieved
by identifying whether it is a descendants of “PP” node in the tree. Specifically, by
traversing from the root to the leave node “beach”, we can obtain two paths for these
two trees respectively:
1. ROOTÑSÑNPÑNNÑ“beach”
2. ROOTÑSÑVPÑPPÑNPÑPPÑNPÑNNÑ“beach”
We can identified that if the scene tag appears as grammar subject, the path will
not consist of any “PP” node. Moreover, even if “PP” node appears in the path, the
scene tag may be important. Consider the sentence “a picture of kitchen with many
cooking appliances”. The root to “kitchen” path will have a “PP” node because of the
subordinating conjunction “of” and thus will be considered as an modifier constituent.
Thus, even if we have found the node “PP” in the path, we have to check whether the
“IN” child is a preposition like “in” , “on” or subordinating conjunction like “of”.
76
ROOT
S
NP
DT
A
JJ
sandy
NN
beach
VP
VBN
covered
PP
IN
in
NP
NP
JJ
white
NNS
surfboards
PP
IN
near
NP
DT
the
NN
ocean
.
.
(a) Constituent structure for sentence (1)
ROOT
S
NP
NNS
surfboards
VP
VBP
sit
PP
IN
on
NP
NP
DT
the
NN
sand
PP
IN
of
NP
DT
a
NN
beach
(b) Constituent structure for sentence (2)
Figure 4.4: The sentence constituent trees of the two sentences in Figure 4.3. The
acronym in the trees are: S (Sentence), NP (Noun Phrase), VP (Verb Phrase), PP
(Preposition Phrase), DT (Determiner), JJ (Adjective), NN (Singular Noun), NNS (Plu-
ral Noun), VBN (Verb, past participle), VBP (Verb, non-3rd person singular present),
IN (Preposition or subordinating conjunction).
77
4.2.2 Proposed Measuring Method
Based on the above analysis, it it necessary to design a new method to jointly measuring
the scene and object tag importance. We propose the method described in Algorithm 2 to
jointly measure the importance of object and scene tags in one sentence for one image.
The final scene and object tag importance are obtained by averaging the importance
vectorI over all the image’s associated sentences.
Algorithm 2 Measuring the object and scene tag importance in one image sentence
description
1: Input: Set of all object tagsT
o
, set of all object tags mentioned in sentenceT
s
o
,
current sentences, scene tagt
s
;
2: Set scene factor c
s
0;
3: Set sentence tree T parseSentencepsq;
4: Set path p findPathpT;T:root;t
s
q;
5: if “PP”Pp && “PP”:leftprep: then
6: c
s
;
7: else ifp!NULL then
8: c
s
;
9: end if
10: fortPT
o
do
11: Iptq
IttPT
s
o
u
|T
s
o
|p1csq
;
12: end for
13: Ipt
s
q
cs
1cs
;
14: return I
In Algorithm 2, the parameter and ( ¡ , in our experiments, we set 2
and 1) represent the weight of scene when scene tag appears in grammar subject or
modifier of sentence, respectively. This is to account the grammatical role of scene tag
in sentence. Moreover, the scene tag will be multiplied by the total number of object tags
to account the imbalanced number between scene and object tag. Thus, the formula to
calculate the object and scene tag importance are in the 11 and 13 lines of Algorithm 2.
To give an example, refer back to Figure 4.2, the calculated importance of object and
scene tags using Algorithm 2 are shown in Table 4.1:
78
Table 4.1: Measured tag importance for Figure 4.2 based on proposed method, where
SGR stands for Scene tag Grammar Role
Sentence # SGR beach person umbrella surfboard
1 subject 0.67 0.33 0 0
2 subject 0.67 0.11 0.11 0.11
3 object 0.67 0.17 0.17 0
4 subject 0.67 0.17 0.17 0
5 modifier 0.5 0.17 0.17 0.17
Overall 0.63 0.19 0.12 0.06
Compared with purely discounted probability based importance shown in Figure 4.2,
the calculated importance shown above are more consistent with human subjective feel-
ing.
4.3 Jointly Predicting Scene and Object Tag Impor-
tance
In this section, we extend our object tag importance prediction model proposed in Chap-
ter 3 to account the scene tag importance at the same time. Similarly, we described the
three types of features used to predict the scene tag importance. We then extend our
structure model to jointly predict the object and scene tag importance at the same time.
Finally, we propose a Mixed Integer Programming (MIP) formulation of inference prob-
lem, which can be efficiently solved using sophisticated optimization algorithm such as
branch and bound method.
4.3.1 Features for Predicting Scene Tag Importance
The proposed 3 types of features for predicting object tag importance also applies to
scene tag.
79
bathroom
beach
kitchen
living room
mountain
street
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Mean and Standard Deviation of Importance
for 6 Scene Category
Importance Mean
Importance std
Figure 4.5: The mean and the standard deviation of the ground truth tag importance for
6 scene categories in the COCO dataset.
Semantic Features. As discussed in Sec. 3.4.1, the semantic cues serves as a human
prior knowledge about the importance of specific object category. This applies to the
scene tag as well. To demonstrate this, we plotted out the mean and standard devia-
tion of importance value for 6 scene categories in the COCO dataset, which is shown
in Figure 4.5. The 6 scene categories are chosen to be both versatile and representative.
They include 3 indoor scenes (“bathroom”, “kitchen” and “living room”) and 3 outdoor
scenes. The outdoor scenes can be further classified into 2 natural scenes (“beach” and
“mountain”) and 1 man made scene (“street”). In general, we found out that indoor
scenes image will more likely focus on the overall scene structure rather than their indi-
vidual objects within. This is especially true for “bathroom” and “kitchen”.
The reason why there is such difference is due to the composite nature of each scene
type. Consider the two images shown in Figure 4.6 (a), we can observe that people tend
to mention these two images as iconic scene images even though there are many objects
within the image. This is because each object in the image can be considered as a nec-
essary component of the whole scene. For example, the room without a “toilet” can
80
not be considered as a “bathroom”. In other words, it is common sense that the “bath-
room” should contain a “toilet”. When all component objects are combined together to
form the scene, people tend to mention the scene as a whole concept rather than each
individual objects.
On the other hand, the necessary components for the outdoor scene are usually
“stuff” like “sky”, “ocean”, “sand”, “grass”, “road” etc., rather than solid “objects”.
In other words, the added object tags such as “surfboard”, “car”, “umbrella” provide
extra information to the whole image, and are more likely to be mentioned along with
the overall scene type.
For the above reasons, it is necessary to encode the scene tag category as a feature
during importance prediction. Just like semantic feature for object tag, we encode the
scene type as a categorical feature.
Visual Features. Unlike object tag, which has size, location and saliency as strong
indicator of importance, scene is a relative abstract concept with no obvious visual fea-
ture to indicate its importance. In [7], Berg et al. used a human annotated “scene
strength” (range from 1 to 5) as a feature along with scene type to predict whether the
scene tag will be mentioned in the sentence. This “scene strength” is essentially a visual
feature but provided directly by human.
Figure 4.6 (b) shows one iconic “beach” image and a iconic object image with scene
“beach”. Clearly, one desired properties of an iconic outdoor scene image is the “open-
ness” of the general scene type. Moreover, from the discussion of semantic feature we
know that there are certain stuff component that are important for the image to be con-
sidered as a particular type of scene. For “beach” scene, the stuff components are “sky”,
“ocean”, and “sand”, which usually have unique discriminant color properties. The bal-
ance composition between these 3 types of stuff components is important to how human
think the image is iconic scene or not. As shown in the right image of Figure 4.6, all 3
81
bathroom: 0.9;
sink:0; toilet:0;
kitchen: 1.0;
microwave: 0; sink:0; oven:0;
remote:0; chair:0;
(a) Semantic
beach: 0.8;
person: 0.8; dog: 0; bench: 0
beach: 0.2
cow: 0.6; boat: 0.3
(b) Visual
living room: 0.3; clock: 0; tv: 0; chair: 0;
dining table: 0; couch: 0.1; person 0.5;
living room: 0.8; clock: 0; chair: 0;
dining table: 0; couch: 0.1; cat; 0.2;
living room: 0.9; laptop:0; vase: 0; chair: 0;
dining table: 0; couch: 0.1; cell phone: 0
(c) Context
Figure 4.6: 3 cues for predicting scene tag importance. The ground truth importance are
shown below each image.
82
stuff components of scene “beach” are significantly less dominant compared to the left
image due to occlusion and close up of camera, thus result in a lower scene tag impor-
tance. We thus adopt the FC7 layer (Fully Connected layer 7) features extracted using
VGG16 trained on the Places dataset [115] to model the scene property.
Context Features. Based on the definition of scene tag importance and object tag
importance introduced in Sec. 4.2, it is intuitive to understand that scene tag importance
and object tag importance will be interdependent.
To give an example, consider the three images shown in Figure 4.6 (c) with scene
type “living room”. The right image can be considered as a classical “living room”
scene images and all the objects within can be considered as a object components of
the overall scene structure. For the middle image, the object “cat” only take out a bit
of the importance of the “living room” due to its semantic interestingness but relative
small size. At last, in the left most images, the person becomes the dominant objects
due to their prominent size and location, and the importance of “living room” has been
suppressed by the “person”. Clearly, by removing the “person” in the left image and
the “cat” in the middle image, the “living room” in three images will no doubt to be
equally important. This confirms the assumption that scene and object tag importance
are interdependent.
To model the context cues between scene and object tag importance, we use a similar
approach described in Sec. 3.4.1. We define a tag pair type vectorp
is
to indicate object
and scene tag that form the edgepi;sq. It is a categorical vector that models the semantic
of tag pair. Moreover, as discussed above, the size of the objects is also important to
affect the interaction between scene and objects. We thus define the context feature as
g
s
is
s
i
p
is
, wheres
i
denote the sum of size of theith object tag’s bounding boxes.
83
street
car
dog potted plant
Object Tag Pair
Scene/Object Tag Pair
object tag: car; dog; potted plant
scene tag: street
Figure 4.7: A sample image and its corresponding joint MRF model.
4.3.2 Joint Structured Model
We extend our structured model proposed in Sec. 3.4.2 to jointly predict object and scene
tag importance. Figure 4.7 shows an example joint MRF model for the sample image. It
can be observed that a scene tag node has been added to the original structured model.
Moreover, each object tag node is connected with scene tag node. More formally, each
image is represented by a MRF modelpV;Eq whereV V
o
v
s
. HereV
o
denote the
set of object tags of current image and v
s
denote the current scene tag. The edge set
EE
o
E
os
, whereE
o
tpv
i
;v
j
q :v
i
;v
j
PV
o
u is the edge set for object tag pair and
E
os
tpv
i
;v
s
q :v
i
PV
o
u is the edge set for object scene tag pair.
The energy function of the joint MRF model can be written as Eq. (4.1).
EpX
o
;x
s
;G
o
;G
s
;y;wq
¸
iPVo
w
T
Vo
'
Vo
px
i
;y
i
q
looooooomooooooon
object tag visual & semantic
¸
pi;jqPEo
w
T
Eo
'
Eo
pg
o
ij
;y
i
;y
j
q
loooooooooomoooooooooon
object tag pair context
w
T
Vs
'
Vs
px
s
;y
s
q
looooooomooooooon
scene tag visual & semantic
¸
pi;sqPEos
w
T
Eos
'
Eos
pg
s
is
;y
i
;y
s
q
looooooooooomooooooooooon
object scene tag pair context
;
(4.1)
84
which can be decomposed into four parts. The first two parts are the same as that in
Eq. (3.3). The third term corresponding to the scene tag visual and semantic features.
The laster term models the context feature between object and scene tag.
The weight vectorw
w
T
Vo
w
T
Eo
w
T
vs
w
T
Eos
T
is the weight vector that will be
learnt from the training data. The joint kernel maps'
Vo
and'
vs
, takes the same form
in Eq. (3.4), while'
Eo
and'
Eos
takes the form in Eq. (3.5).
In the learning stage, the same SSVM formulation in Eq. (3.6) can be formed, except
that now is a function ofX
o
;x
s
;G
o
;G
s
;y, and can be written as:
pX
o
;x
s
;G
o
;G
s
;yq
°
iPVo
'
V
px
i
;y
i
q
°
pi;jqPEo
'
Eo
pg
o
ij
;y
i
;y
j
q
°
iPvs
'
vs
px
s
;y
s
q
°
pi;sqPEos
'
Eos
pg
s
is
;y
i
;y
s
q
:
It can be easily found out thatw
T
pX
o
;x
s
;G
o
;G
s
;yqEpX
o
;x
s
;G
o
;G
s
;y;wq.
We still use the MAD as the loss function in the optimization process of SSVM.
4.3.3 Inference: a Linear Integer Programming Formulation
As discussed in Chapter 2, inference is an important step even in the learning process,
and its efficiency can be the bottleneck of the training speed. In this section, we convert
our inference problem as a linear integer programming problem [87], which can be
potentially fed into the optimizer to speed up the inference.
Observing that the objective function Eq. (4.1) is a function ofy during inference
process, we can define two sets of auxiliary binary variables y
ik
, and d
ijl
. Here the
variable setty
ik
:iPV;k 0; ; 10u represent whether theith node take the impor-
tance valuek{10, and the variable settd
ijl
:i;jPV;j¡i;l10; ; 10u represent
the whether theith andjth nodes’ importance value has a difference ofl{10. Moreover,
85
the weight vector can be further decomposed. For example, we can write weight vector
w
Vo
w
T
Vo1
w
T
Vo10
T
. Thus, the first term of Eq. (4.1) can be written as a function
ofy
ik
:
w
T
Vo
'
Vo
px
i
;y
i
q
10
¸
k0
w
T
Vok
x
i
y
ik
Letc
ik
w
T
Vok
x
i
, we havew
T
Vo
'
Vo
px
i
;y
i
q
°
10
k0
c
ik
y
ik
. Applying the same change of
variables trick to the remaining three terms, the objective function can be rewritten as:
fpty
ik
u;td
ijl
uq
¸
iPV
10
¸
k0
c
ik
y
ik
¸
pi;jqPEo
10
¸
l10
c
ijl
d
ijl
¸
pi;sqPEos
10
¸
l10
c
isl
d
isl
(4.2)
Besides the objective function, we still need to have a couple of constraints to satisfy,
such as
°
10
k0
y
ik
1 and
°
10
l10
d
ijl
1. We also need to ensure thatd
ijl
1 if and
86
only ify
i
y
j
l. To achieve this, we define extra auxiliary variablesd
ij
py
i
y
j
q.
The final Linear Integer Programming formulation of inference problem becomes:
max
y
ik
;d
ijl
¸
iPV
10
¸
k0
c
ik
y
ik
¸
pi;jqPEo
10
¸
l10
c
ijl
d
ijl
¸
pi;sqPEos
10
¸
l10
c
isl
d
isl
s.t.
10
¸
k0
y
ik
1;@iPV
10
¸
l10
d
ijl
1;@pi;jqPE
10
¸
k0
ky
ik
y
i
;@iPV
10
¸
l10
ld
ijl
d
ij
;@pi;jqPE
d
ij
y
i
y
j
;@pi;jqPE
¸
iPV
y
i
¤ 10
y
ik
Pt0; 1u;d
ijl
Pt0; 1u;@iPV;@pi;jqPE
y
i
Pt0; ; 10u;d
ij
Pt10; ; 10u;@iPV;@pi;jqPE
(4.3)
where the first two constraints assure thaty
i
andd
ij
can take only one value, the 3rd and
4th constraints enforce the relationship between auxiliary binary variable and integer
variable. The 6th constraint make sure that total importance of one image is less than 1.
The formulation in Eq. (4.3) can be further formulated as a Binary Integer Program-
ming but at the cost of introducing much more variables, which is not a good trade off
for inference efficiency.
87
4.4 Experimental Results
In this section, we first discuss the datasets and our experimental settings. We proceed to
present our subjective test results to justify that the ground truth tag importance based on
descriptive sentences is consistent with human perception. Then, we compare the per-
formance of different tag importance prediction models. Finally, experiments on three
retrieval tasks are conducted to demonstrate the superior performance of the proposed
MIR/TIP system.
4.4.1 Dataset
To test the proposed system, we need datasets that have many annotated data avail-
able, including sentence descriptions, object tags, object bounding boxes and scene tags.
Table 4.2 lists the profiles of several image datasets in the public domain. Among them,
the UIUC, COCO, and VisualGenome datasets appear to meet our need the most since
they have descriptive sentences for each image. However, the Visual Genome dataset
aims to be an “open vocabulary” dataset with 80,138 object categories but on average
only 49 instances for each object category. This poses certain difficulties in training
CCA subspace and Faster R-CNN due to the large tag vocabulary size and limited train-
ing instances per category respectively.
Consequently, to test our proposed full system with scene tag, we enrich COCO with
scene tags and call it COCO Scene dataset. Specifically, we first identified 30 common
scene types in the COCO dataset. Then, 50 human workers were invited to manually
classify 60,000 images randomly drawn from the COCO dataset into one of 31 groups,
which include 30 scene types mentioned above and an extra group indicated as “Not
sure/None of above”. This is necessary as the COCO dataset contains a large amount
of object centric images, whose scene types are hard to identify even for human. This
88
street
living room
beach
bathroom
kitchen
wild_field
snow_field
pasture
bedroom
restaurant
baseball_field
ocean
tennis_court
railway
home_office
sky
plaza
hotel
highway
market
parking lot
forest
yard
harbor
football_field
office
tower
dining room
airport_terminal
bakery_shop
10
2
10
3
Total Image
Total number of images for each scene category
Figure 4.8: The image number of each scene category in the COCO Scene dataset.
resulted in a dataset consisting of 25,124 images with a tag vocabulary of 110. It has on
average 4.3 tags per image. We will refer to this dataset as the COCO Scene dataset for
the rest of this section.
Table 4.2: Comparison of major image datasets.
Dataset Sentences
Object
Tag
Bounding
Box
Scene
Tag
UIUC [79] Yes Yes Yes No
COCO [65] Yes Yes Yes No
SUN2012 [105] No Yes Yes Yes
LabelMe [84] No Yes Yes No
ImageNet [20] No Yes Yes No
Visual Genome [52] Yes Yes Yes No
Figure 4.8 and Figure 4.9 show the statistics of scene and object categories for the
COCO Scene dataset.
89
person
chair
car
bottle
cup
dining table
sink
tv
book
bed
couch
handbag
bowl
sports ball
surfboard
backpack
traffic light
truck
potted plant
laptop
clock
oven
toilet
dog
vase
skis
knife
umbrella
tennis racket
mouse
keyboard
bench
remote
cell phone
boat
train
refrigerator
spoon
cat
bus
motorcycle
zebra
baseball glove
bicycle
cow
kite
baseball bat
microwave
bird
wine glass
fork
horse
airplane
sheep
suitcase
banana
giraffe
snowboard
pizza
skateboard
frisbee
orange
apple
cake
tie
elephant
fire hydrant
teddy bear
stop sign
donut
sandwich
bear
carrot
toothbrush
parking meter
scissors
toaster
broccoli
hot dog
hair drier
10
1
10
2
10
3
10
4
10
5
Total number of images for each object category
Figure 4.9: The image number of each object category in the COCO Scene dataset.
4.4.2 Subjective Test Performance of Measured Tag Importance
Since ground truth continuous-valued tag importance is used to measure the degree of
object/scene importance in an image as perceived by a human, it is desired to design a
subjective test to evaluate its usefulness. Here, we would like to evaluate it by checking
how much it will help boost the retrieval performance. Specifically, we compare the
performance of two different relevance functions and see whether the defined ground
truth continuous-valued tag importance correlates human experience better. The two
relevance functions are given below.
1. The relevance function with measured ground truth tag importance:
r
g
pp;qq
xI
p
; I
q
y
}I
p
}}I
q
}
; (4.4)
where I
k
denotes the ground truth continuous-valued importance vector for image
k (kp orq).
90
2. The relevance function with binary-valued importance [7] (whether appeared in
sentences or not):
r
b
pp;qq
xt
p
; t
q
y
}t
p
}}t
q
}
; (4.5)
where t
k
denotes the binary-valued tag importance vector for imagek (k p or
q).
In the experiment, we randomly selected 500 image queries from the COCO Scene
dataset and obtained the top two retrieved results with the max relevance scores using
two relevance functions mentioned above. In the subjective test, we presented the two
retrieved results of the same query in pairs to the subject, and asked him/her to choose
the better one among the two. We invited five subjects (one female and four males with
their ages between 25 and 30) to take the test. There were 1500 pairwise comparisons in
total. We randomized the order of two relevance functions in the GUI to minimize the
bias. Moreover, each subject viewed each query at most once. We made the following
observation from the experiment. As compared with the results using the relevance func-
tion with binary-valued importancer
b
, the results using our relevance functionr
g
were
favored in 1176 times (out of 1500 or 78.4%). This indicates that the relevance function
with ground truth importancer
g
does help improve the retrieval performance and it also
demonstrates the validity of the proposed methodology in extracting sentenced-based
ground truth tag importance.
4.4.3 Performance of Tag Importance Prediction
In this part, we will evaluate how our importance prediction model performs on both
object tag and scene tag importance prediction. Again, the MAD is used as the eval-
uation metric, and the error is estimated by conducting 5-fold cross validation over all
dataset images.
91
baseline
visual
visual+semantic
our model
our model/true bbox
binary true
binary predicted
0.00
0.05
0.10
0.15
0.20
0.25
Error
0.180
0.130
0.115
0.104
0.101
0.105
0.146
0.172
0.131
0.113
0.100
0.097
0.095
0.136
0.215
0.139
0.132
0.124
0.122
0.142
0.187
Average tag error
Average object tag error
Average scene tag error
Figure 4.10: Comparison of continuous-valued tag importance prediction errors of seven
models.
We tried out 7 different kinds of modules (same as Sec. 3.6.3) to demonstrate the
performance of our tag importance prediction module. The prediction error bars of all
different setting are shown in Figure 4.10, where the baseline simply indicates treating
each tag as equally important within one image. Moreover, we plotted 3 error bars for
object tag only, scene tag only, and overall error.
These figures show that our proposed structured prediction model (4th) can achieve
approximately 40% performance gain with respect to the baseline (1st). For COCO
Scene dataset, we observe performance gain of all 3 feature types, among which visual,
semantic, and context features result in approximately 28%, 11%, and 11% prediction
error reduction respectively. By comparing the 4th to the 5th model, we find the object
detection error only results in approximately 3% performance loss. Moreover, it is noted
that even true binary-valued tag importance lead to non negligible prediction error by
ignoring relative importance between tags, and this error will propagate to predicted
92
binary-valued tag importance model, resulting in 45% performance loss over our pro-
posed model on COCO Scene dataset. Lastly, it is observed that scene tag importance
is more difficult to predict as compared to object tag importance. Thus, the overall error
(average over both object and scene tag importance) is higher than the average error of
object tag importance but lower than that of scene tag importance.
4.4.4 Performance of Multimodal Image Retrieval
In this subsection, we show the retrieval experimental results on COCO Scene datasets.
All settings are the same with that in Sec. 3.6.4.
I2I Results. The NDCG curves of COCO Scene datasets are shown in Figure 4.11.
We find significant improvement of all MIR systems over visual baseline. It seems
deep features can give reasonable performance on retrieving the most similar image,
but its efficiency lags behind MIR systems asK becomes larger. Second, our proposed
MIR/PCTI system exhibits considerable improvements over other practical MIR sys-
tems, including Traditional MIR and MIR/PBTI. Specifically, forK 50 (the typical
number of retrieved images user is willing to browse for one query), the MIR/PCTI can
achieve approximately 10% gain over visual baseline, 2% over Traditional MIR, and
2% over MIR/PBTI on COCO Scene dataset. Moreover, our proposed MIR/PCTI sys-
tem can even match the upper bound of MIR system using binary-valued importance,
and it only has 2% performance gap with its upper bound. Finally, by associating Fig-
ure 4.11 to Figure 4.10, we can identify that the tag importance prediction performance
roughly correlates to retrieval performance. Thus, better tag importance prediction leads
to better I2I retrieval performance. Some qualitative I2I retrieval results are shown in
Figure 4.12. Generally speaking, our proposed system can capture the overall semantic
of queries more accurately, such as “person playing wii in the living room” for the 1st
93
0 20 40 60 80 100
Top k
0.68
0.70
0.72
0.74
0.76
0.78
NDCG@k
Baseline
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 4.11: The NDCG curves for the image-to-image retrieval on the COCO Scene
dataset. The dashed lines are upper bounds for importance based MIR systems.
query and “person playing frisbee in yard area” for the 2nd query, while the remaining
3 systems fail to preserve some important objects such as “remote” or “frisbee”.
T2I Results. We show the NDCG curves of T2I results on COCO Scene dataset
in Figure 4.13, respectively. Our proposed MIR/PCTI model shows consistent supe-
rior performance over Traditional MIR and MIR/PBTI. Particularly, for K=50, the
MIR/PCTI outperforms the Traditional MIR by 11%, and the MIR/PBTI by 5% on
COCO Scene dataset. Moreover, the proposed MIR/PCTI system only has 2% perfor-
mance gap with its upper bound. Figure 4.14 shows the two qualitative results of T2I
retrieval, where the two input queries consist of the same tag pair but have different
94
Query (A) (C) (B) (D)
Query (A) (C) (B) (D)
Figure 4.12: Top three I2I retrieved results for two exemplary queries, where the four
columns show four retrieval systems: (A) Visual Baseline, (B) Traditional MIR, (C)
MIR/PBTI, and (D) MIR/PCTI.
focus. It is observed that our proposed system can correctly retrieve scene/object centric
images as indicated by the importance value, while the other two systems can not.
I2T Results. The results of tagging on COCO Scene dataset are shown in Fig-
ure 4.15 and (b). Again, we observe consistent improvements of MIR/PCTI over Tra-
ditional MIR and MIR/PBTI. Specifically, for K 3 (the typical number of tags
95
0 20 40 60 80 100
Top k
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
0.92
NDCG@k
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 4.13: The NDCG curves for the tag-to-image retrieval on the COCO Scene
dataset. The dashed lines are upper bounds for importance based MIR systems.
each image have in the dataset), the MIR/PCTI can achieve approximately 13% gain
over baseline, 10% over Traditional MIR, and 5% over MIR/PBTI on the COCO
Scene dataset. More surprisingly, its performance can outperform the upper bound of
MIR/PBTI and match that of MIR/TCTI. This suggests that our proposed system can
not only generate tags but also rank them according to their importance. Sample quali-
tative tagging results are shown in Figure 4.16, in which we can see that our proposed
model will rank more important tags (“cow” and “cat”) ahead of unimportant/wrong
ones (“person” and “bathroom”).
96
(A) (C) (B)
cow:
0.8
pasture:
0.2
(A) (C) (B)
cow:
0.3
pasture:
0.7
Figure 4.14: Tag-to-Image retrieval results for two exemplary query with different focus,
where the three columns correspond to the top three ranked tags of three MIR systems:
(A)Traditional MIR, (B) MIR/PBTI, and (C) MIR/PCTI.
4.5 More Qualitative Results
Finally, more qualitative retrieval results for all 3 tasks are shown in Figure 4.17 to
Figure 4.30.
97
1 2 3 4 5 6 7 8 9 10
Top k
0.65
0.70
0.75
0.80
0.85
0.90
0.95
NDCG@k
Baseline
Traditional MIR
MIR/PBTI
MIR/PCTI
MIR/TBTI
MIR/TCTI
Figure 4.15: The NDCG curves for auto ranked tag list generation on the COCO Scene
dataset. The dashed lines are upper bounds for importance based MIR systems.
4.6 Summary
Scene is an important component of images. In this chapter, we propose to add scene
tag importance into our MIR framework. We first discussed the potential issues of mea-
suring scene tag importance along with object tag importance, and propose a sentence
constituent parsing tree based method to measure scene tag importance. Then we further
extend our structured model proposed in Chapter 3 to jointly predict scene and object tag
importance. The thorough experiment results confirmed that our proposed MIR/PCTI
system can greatly improve the robustness of the MIR framework.
98
(A) (A) (A)
pasture
cow
horse
(B) (B) (B)
pasture
cow
person
(C) (C) (C)
pasture
cow
person
(D) (D) (D)
cow
pasture
bird
(A) (A) (A)
bathroom
toilet
cat
(B) (B) (B)
bathroom
cat
toilet
(C) (C) (C)
bathroom
cat
toilet
(D) (D) (D)
cat
toilet
bathroom
Figure 4.16: Tagging results for two exemplary images, where the four columns corre-
spond to the top three ranked tags of four MIR systems: (A) Baseline, (B) Traditional
MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
Query (A) (C) (B) (D)
Figure 4.17: I2I Query 1. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
99
Query (A) (C) (B) (D)
Figure 4.18: I2I Query 2. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
Query (A) (C) (B) (D)
Figure 4.19: I2I Query 3. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
100
Query (A) (C) (B) (D)
Figure 4.20: I2I Query 4. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
Query (A) (C) (B) (D)
Figure 4.21: I2I Query 5. The four columns show four retrieval systems: (A) Visual
Baseline, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
101
(A) (C) (B)
beach:
0.1
person:
0.5
sur2board:
0.4
Figure 4.22: T2I Query 1. The three columns show four retrieval systems: (A) Tradi-
tional MIR, (B) MIR/PBTI, and (C) MIR/PCTI.
(A) (C) (B)
beach:
0.7
person:
0.1
sur2board:
0.2
Figure 4.23: T2I Query 2. The three columns show four retrieval systems: (A) Tradi-
tional MIR, (B) MIR/PBTI, and (C) MIR/PCTI.
102
(A) (C) (B)
oven:
0.8
kitchen:
0.2
Figure 4.24: T2I Query 3. The three columns show four retrieval systems: (A) Tradi-
tional MIR, (B) MIR/PBTI, and (C) MIR/PCTI.
(C) (B)
baseball
'ield:
0.8
person:
0.2
(A)
Figure 4.25: T2I Query 4. The three columns show four retrieval systems: (A) Tradi-
tional MIR, (B) MIR/PBTI, and (C) MIR/PCTI.
103
(A) (A) (A)
yard
dog
frisbee
(B) (B) (B)
yard
dog
frisbee
(C) (C) (C)
yard
dog
frisbee
(D) (D) (D)
dog
frisbee
yard
Figure 4.26: Tagging result 1. The four columns show four retrieval systems: (A) Base-
line, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
(A) (A) (A)
beach
person
umbrella
(B) (B) (B)
beach
person
horse
(C) (C) (C)
beach
horse
person
(D) (D) (D)
horse
beach
person
Figure 4.27: Tagging result 2. The four columns show four retrieval systems: (A) Base-
line, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
(A) (A) (A)
person
beach
kite
(B) (B) (B)
person
beach
kite
(C) (C) (C)
beach
person
kite
(D) (D) (D)
beach
kite
person
Figure 4.28: Tagging result 3. The four columns show four retrieval systems: (A) Base-
line, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
(A) (A) (A)
wild_field
elephant
bird
(B) (B) (B)
wild_field
elephant
bird
(C) (C) (C)
wild_field
elephant
bird
(D) (D) (D)
elephant
wild_field
bird
Figure 4.29: Tagging result 4. The four columns show four retrieval systems: (A) Base-
line, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
104
(A) (A) (A)
bathroom
cat
toilet
(B) (B) (B)
bathroom
cat
sink
(C) (C) (C)
bathroom
cat
sink
(D) (D) (D)
cat
sink
bathroom
Figure 4.30: Tagging result 5. The four columns show four retrieval systems: (A) Base-
line, (B) Traditional MIR, (C) MIR/PBTI, and (D) MIR/PCTI.
105
Chapter 5
Improving Object Classification via
Confusing Categories Study
5.1 Motivation
Object classification is a long-standing problem in the computer vision field, which
serves as the foundation for other problems such as object detection [30, 31, 38, 81],
scene classification [62, 108], and image annotation [61]. In recent years, the deep
Convolutional Neural Network (CNN) has achieved a significant performance gain over
traditional methods due to the availability of large-scale training data and a better opti-
mization procedure. However, its performance is still not as robust as humans, and its
practical usage in high demanding tasks is still to be explored.
As the number of object categories continues to increase, it is inevitable to have cer-
tain categories that are more confusing than others due to the proximity of their samples
in the feature space. For example, while classes “husky” and ’“malamute” all belong
to coarse category“dog”, “grasshopper” and “cricket” are sub-categories of “insect”.
Intuitively, it is much easier to distinguish “dog” from “insect”, but harder to tell either
“husky” from “malamute” or “grasshopper” from “cricket”. This type of semantic con-
fusion is also reflected in the classification results of deep CNNs. The importance of
untangling confusing object categories is illustrated in Figure 5.1, in which the top row
shows two groups of images from the same category - “horned rattlesnake”. While the
106
horned rattlesnake
Confusion Set
thunder snake
hognose snake
king snake
night snake
boa constrictor
rock python
Indian cobra
horned viper
diamondback
horned rattlesnake
Figure 5.1: An illustration of subsets within a confusion set that contains multiple object
categories (e.g. thunder snake, king snake, etc.) as shown in the bottom-left part of
the figure. The top row: two groups of images under the same “horned rattlesnake”
category. The bottom row: two subsets in a confusion set that contains the “horned
rattlesnake” category. The two subsets of images encircled by the green box are visually
similar, thus leading to the confusion set.
left one focuses on the serpentine shape of a “snake”, the right one features its curly pos-
ture. The lower row of Figure 5.1 shows two subsets in multiple confusing categories
including “horned rattlesnake”. Clearly, a common visual property is observed in the
upper and lower left image groups, which explains why the “horned rattlesnake” is likely
to be confused with other categories in this confusion set. Therefore, identifying subsets
within a confusion set plays a key role in improving its classification performance. The
capability of today’s CNN in distinguishing confusing categories is still limited.
107
In this chapter, we target at improving the classification performance for the images
in a confusion set. To achieve this goal, we have to address two issues: 1) how to iden-
tify and group confusing object categories automatically using the CNN property, and 2)
how to boost the classification performance in a confusion set. To address the first issue,
we conduct a detail analysis on confusing categories to identify the innate reason why
CNNs got confused by certain categories with others. Rather than learning the hierar-
chical structure based on the CNN confusion matrix, we proposed a clustering approach
to automatically cluster confusing object categories into confusion sets specific to a pre-
trained network. To handle the second issue, we adopt a binary-tree-structured (BTS)
clustering method to split a confusion set into multiple subsets. A classifier is sub-
sequently learned within each subset to capture its unique discriminative features and
enhance its classification performance. The above two procedures form the core of the
proposed Confusing Categories Identification and Resolution (CCIR) scheme. Exper-
imental results on the ImageNet ILSVRC2012 dataset show that the proposed CCIR
scheme can offer a significant performance gain over the AlexNet and the VGG16.
Our work differs from the previous ones in several aspects. First, instead of exploit-
ing the hierarchy directly for performance improvement, we draw the link between cat-
egory hierarchy and category confusion. Particularly, we conduct insightful analysis
on why a CNN performs poorly on certain classes. Second, to boost classification per-
formance, previous work either adopted a pre-defined hierarchy [18, 29] or learned the
hierarchy from the data or the confusion matrix [66, 107]. Here, being inspired by the
new mathematical model of CNNs in [56], we adopt a new method to identify confu-
sion sets directly from the CNN without computing confusion matrix on hold out data.
This is feasible as the CNN itself is learned from training data, which should have suf-
ficient information needed for confusion analysis. We will show that, as compared with
our method, the hierarchy derived from the confusion matrix on randomly sampled data
108
anchor vector
Confusing Categories Identification
subset
Binary Tree structured Clustering
Final Prediction
Mixed Subset Classifiers
Baseline CNN
Confusing Categories Resolution
A2
D2
D1
F1 F2
C1
C2
C3
A1
A3
E3
E1 E2
B1
B2
A1: tusker
A2: Indian elephant
A3: African elephant
B1: Yorkshire terrier
B2: silky terrier
C1: Eskimo dog
C2: malamute
C3: Siberian husky
D1: cucumber
D2: zucchini
Figure 5.2: An overview of the proposed CCIR system. It consists of 3 modules: 1) a
baseline CNN, 2) a Confusing Categories Identification (CCI) module, and 3) a Confus-
ing Categories Resolution (CCR) module.
is more susceptible to data bias. Third, we investigate visual confusion subsets within
semantic confusing categories, which has never been done before. To solve this problem,
instead of conducting net surgery to get a new network structure as done in [18,107], we
propose a clustering-based confusion set enhancement solution. It achieves comparable
performance improvement based on results of baseline CNNs without CNN re-training.
5.2 Proposed CCIR Scheme
5.2.1 System Overview
A high-level overview of the proposed confusing categories identification and resolution
(CCIR) scheme is given in Figure 5.2. The system consists of a baseline CNN classi-
fier trained by the ImageNet dataset and two additional modules; namely, a Confusing
109
Categories Identification (CCI) module and a Confusing Categories Resolution (CCR)
module. The system can adopt any CNN classifier for the ImageNet dataset as a baseline
method, e.g. the AlexNet or the VGG16.
In the CCI module, the anchor vectors of the baseline CNN model for different cat-
egories are used to build a confusion graph. Particularly, we will show in Sec. 5.2.2 that
the angle between anchor vectors of two categories indicates their degree of confusion.
Based on this observation, we can obtain the confusion graph, where each node denotes
an object category and the edge denotes their degree of confusion. A subgraph of con-
fusion graph is shown in the CCI module in Figure 5.2, where we use a thick, a thin
and no edge to approximately indicate the degree of confusion between different object
categories. Finally, a graph-based clustering method divides the confusion graph into
cliques, based on which confusion sets are generated.
In the CCR module, a BTS clustering is applied to the training images in the confu-
sion set to obtain subsets, which correspond to the leaf nodes of hierarchy. While some
subsets contain images from different categories (see the 1st and the 4th subsets in Fig-
ure 5.2), others contain images from the same class (see the 2nd and the 3rd subsets in
Figure 5.2). They are called mixed and pure subsets, respectively. A specific classifier is
trained within each mixed subset based on the training images. To classify a test image,
the CNN uses its feature vector in the last FC layer to predict its category and confusion
set, and traverse the built binary clustering tree from the root to a leaf node. Then, if the
subset is a mixed one, the trained subset classifier will be used to predict its final label.
Otherwise, the subset is a pure one, and the category label of subset will be output as
the final prediction. Details of the CCI and CCR modules will be given in the following
two sections.
110
5.2.2 Confusing Categories Identification
The goal of the CCI module is to find a many-to-one mappingg :KÞÑS from object
category setKt1;Ku to confusion setsSt1;Su. The confusion setK
s
can
be written as
K
s
tk|gpkqs forkPKu: (5.1)
As discussed in Sec. 5.1, the baseline CNN should have enough information to tell
confusion between categories as it is learned from the training data.
Since the convolution and inner product operation can be viewed as signal correla-
tion or projection, it essentially measures the similarity of the input signal to the ref-
erence signal, which are weights of a filter and thus called the anchor vector in [56].
Mathematically, lettw
i
u
K
i1
be the learned weights (anchor vectors) for K classes in
the last FC layer before the softmax output. Supposex is the input feature vector (e.g.
FC7 feature of AlexNet) of an image, CNN will predict it as classj if the inner product
between input feature vectorx and anchor vectorw
j
is maximized. Mathematically, we
have
j argmax
iPK
w
T
i
x argmax
iPK
}w
i
}}x} cospw
i
;xq
argmax
iPK
}w
i
} cospw
i
;xq;
(5.2)
where pw
i
;xq is the angle between anchor vector of the ith category w
i
and input
feature vectorx. The second equality holds as}x} is a constant when calculating inner
product forK different categories. Moreover, as discussed in [51], within-layer weight
normalization during training leads to better classification performance. This indicates
111
that}w
i
} will be approximately the same for all K classes if the network is trained
properly. Based on this assumption, Eq. (5.2) can be approximated as
j argmin
iPK
pw
i
;xq: (5.3)
This equation has an intuitive interpretation. That is, for images belonging to the ith
category, their input features x are typically aligned with their anchor vector w
i
in a
high dimensional feature space. Intuitively, if the angle between two categories’ anchor
vectors is small, it is more likely that their CNN output probabilities are similar, leading
to confusion. This relationship is shown in Figure 5.3. Consequently, it is reasonable to
use the angle between categories’ anchor vectors to measure the degree of confusion.
The above analysis lays the foundation of the CCI module. In this module, we first
construct a pairwise connected graphGpV;Eq as shown in Figure 5.4, where nodes
tv
i
u
K
i1
denote the object categories in the dataset and the weight of an edgee
ij
measures
the degree of confusion betweenith andjth categories. We define an affinity matrixA
with itspi;jqth element asa
ij
pw
i
;w
j
q, wherew
i
is the corresponding anchor
vector for theith class. Furthermore, we choose a cut off threshold and seta
ij
to zero if
}a
ij
}¤ to simplify the affinity matrix. Then, the spectral clustering technique [74] can
be recursively applied toA and its sub-matrices so as to partition the graph into cliques
tV
s
u until|V
s
| T for all s. That is, the number of nodes within the same cliques
is smaller than threshold T . This is needed to avoid a very large confusion set (e.g.
group containing 40 categories of different dogs), which leads to inferior classification
performance based on our empirical observations. Finally, categories within the same
clique will be mapped to the same confusion set. That is, we have
gpkqs if v
k
PV
s
:
112
Indian
cobra
horned
viper
printer
cucumber
zucchini
husky
horned
rattlesnake
Figure 5.3: Illustration of projections of image feature onto various anchor vectors. The
input image comes from the category “horned rattlesnake”, and its feature (red line)
projections onto the anchor vectors of “snake” related categories have similar values.
5.2.3 Confusing Categories Resolution
Both the training and the testing procedures of the CCR module will be described in this
section.
Training. As described in Sec. 5.2, the first step in the CCR module is to generate
subsets from a confusion set using the BTS clustering. Let all images’ feature (e.g. the
FC7 feature of the AlexNet) and their labels be denoted astpx
i
;y
i
qu
i
, and all training
images in the same confusion set asX
s
tpx;yq|gpyqsu. The goal of the BTS
clustering is to generate a clustering treet
s
for each confusion set, where each tree node
int
s
is a set of images, and the leaf nodes represent a subset of visually similar images.
113
Indian
elephant
zucchini
cucumber
laptop
computer
notebook
computer
Eskimo
dog
malamute
Siberian
husky
tusker
African
elephant
computer
keyboard
spacebar
typewriter
keyboard
Yorkshire
terrier
silky
terrier
Figure 5.4: A subgraph of constructed confusion graph, where each node represent
a object category, and the width of the edges between nodes indicates the degree of
confusion.
Algorithm 3 shows the recursive process of the BTS clustering. Specifically, the split
of a parent node can be conducted by any specific clustering algorithm withK 2. In
our experiments, we used spectral clustering. The split process is controlled by the
Intra-CIuster-Variance as indicated by line 6 and line 11 in Algorithm 3, where is a
hyper-parameter used to control the threshold. The Intra-CIuster-Variance
l
is defined
as
l
1
|K
s
|
¸
kPKs
k
l
; (5.4)
114
Algorithm 3 BTS Clustering.
1: Input: Confusion set training imagesX tX
s
u
S
s1
;;
2: fors 1 :S do
3: Set rootpt
s
qX
s
4: repeat
5: for leaveslPt
s
do
6: if
l
¡
X
s then
7: l
1
;l
2
Clusteringpl;K 2q
8: Setl
1
andl
2
as children ofl
9: end if
10: end for
11: until
l
¤
X
s @ leaveslPt
s
12: end for
13: return t
s
@sPS
whereK
s
is defined in Eq. 5.1 and
k
l
is defined as the variance of angle between the
image feature vector and the anchor vector:
k
l
1
X
k
l
¸
xPX
k
l
pw
k
;xq; (5.5)
whereX
k
l
tx|xPl andyku denotes the set of images belonging to subsetl and
categoryk at the same time. The Intra-CIuster-Variance essentially measures the scat-
teredness of images belonged to a particular class within a set of images. Intuitively, if
images of a category within the same subset vary considerably, this subset needs further
splitting.
Two exemplary BTS trees are shown in Figure 5.5, where the left one is a balanced
tree while the right one is not. Moreover, we observe that certain subsets (encircled
by a dashed line bounding box) include images from the same object category, while
others not. These pure subsets are characterized by discriminative features of an unique
category. On the other hand, mixed subsets contain confusing images from multiple
115
green snake
vine snake
green mamba
green snake
green snake
or
green mamba
green snake
vine snake
green mamba
vine snake
RF 1 RF 2
Confusion Set
Confusion set
BTS clustering
Final Prediction
Mixed subset Random
Forest classifiers
photocopier
printer
printer
printer
or
photocopier
photocopier
printer
or
photocopier
RF 3 RF 4
Figure 5.5: Illustration of trees for two confusion sets obtained by the BTS clustering.
They include 3 and 2 object categories, respectively. Pure subsets are encircled by
dashed-line bounding boxes.
categories, and a new classifier is needed to distinguish them. Here, we choose the
random forest classifier using the same feature vector of these images.
Testing. To test an image, we map its baseline CNN predicted label to the corre-
sponding confusion set. Given the BTS tree of this confusion set, the test image will
traverse from the root to a specific leaf node (or a subset) using Algorithm 4. If the
test image arrives at a pure subset, the category label of the image will be output as the
final classification label. Otherwise, the learned random forest classifier will be used to
predict its category.
116
Algorithm 4 Assigning a test image into a specific subset.
1: Input: Test imagex, BTS treet
s
;
2: Initializec to the root oft
s
3: whilec has childrenc
1
andc
2
do
4: c argmin
lc
1
;c
2
x
1
|l|
°
zPl
z
5: end while
6: return c as the designated subset
5.3 Experiments
5.3.1 Experimental Settings
Dataset. We adopted ImageNet ILSVRC2012 dataset for our experiments as it is widely
used to evaluate object classification systems. It consists of approximately 1.3 million
training images and 50k validation images from 1000 categories, which includes mixed
of internal and leaf nodes. While ILSVRC2012 does provide 10k testing images, their
labels were not released to the public. Consequently, we directly treat the validation
images as the testing images and report performance on them, which is a standard pro-
cedure in other literatures [39, 53, 89, 95, 107]. For the usage of training images, we
randomly split the images in each category in proportion of 25:1, where the latter will
be used as hold out data to fine-tune hyper parameter of our system (such as and
Random Forest Classifiers’ parameters).
Networks and Features. We tested our proposed system on top of 2 networks,
which are AlexNet and VGG16. Specifically, their filter weights of FC8 layers will be
used as anchor vectors in our system. For binary tree structured clustering and mixed
subset classifier, we adopted the FC7 features of images output from networks.
Evaluation Protocols. To compare the classification performance of our system
with original networks, we adopted both single-view and dense evaluation protocols
proposed in [88].
117
1. Single-view protocol (S). The image will be first rescaled to a proper size
(227x227 for AlexNet and 224x224 for VGG16) and then fed into the networks;
2. Dense protocol (D). The image will be first rescaled isotropically to make the
smallest image side be 256. Then, networks will be converted into a fully-
convolutional one and a probability map will be generated for each test image.
Category with largest average probability will be output as the predicted label.
It is worthwhile to point out that when adopting the dense protocol to evaluate our
proposed system, we used the average of FC7 features from all crops of image to train
the mixed subset classifier.
5.3.2 Overall Performance
The overall performance of the proposed CCIR system on the AlexNet and the VGG16
are shown in Table 5.1. The evaluation was conducted under both single-view and dense
protocols. We see that the proposed CCIR system can achieve considerable error reduc-
tion for both networks. Under the single-view protocol, the CCIR can achieve 2.35%
and 1.99% in top-1 error reduction for the AlexNet and the VGG16, respectively. Under
the dense protocol, the performance gains drop from 2.35% to 2.10% and from 1.99% to
1.85%, respectively. This is reasonable since the dense protocol essentially offers a sim-
ple way to remove the ambiguity among confusing categories. The performance gain
with respect to the top-5 error is comparatively smaller. This is understandable since
our confusion set usually contains less than 5 object categories and images assigned to
them have less impact on the top-5 error.
In addition to the overall performance, we show improvements on some confusion
sets that the CCIR scheme performs exceptionally well in Table 5.2 and Table 5.3.
118
Method Top-1, Top-5 Error (%)
AlexNet (S) 43.97, 20.61
AlexNet+CCIR (S) 41.62, 19.85
AlexNet (D) 42.10, 19.12
AlexNet+CCIR (D) 40.00, 18.52
VGG16 (S) 34.23, 13.35
VGG16 +CCIR (S) 32.24, 12.84
VGG16 (D) 27.11, 8.92
VGG16 +CCIR (D) 25.26, 8.47
Table 5.1: The overall error rates for the ImageNet validation data.
Again, we see that the confusion sets with the best top-1 and top-5 performance dif-
fer significantly. While the confusion sets with the best top-1 performance are small, the
ones with the best top-5 are larger with more categories. The main reason why smaller
confusion sets tend to have insignificant top-5 performance gain is that, if a test image
is routed to an incorrect confusion set, this error will be difficult to be corrected since
the true category’s probability will remain the same with the CNN output. To alleviate
this problem, it is possible to assign a test image to multiple confusion sets depending
on the top-5 CNN output categories, and weighted average the final predictions of all
confusion sets to obtain CCIR prediction. This direction will be explored in our future
work.
Figure 5.6 shows 3 test images where the CCIR scheme successfully correct the
mistakes made by the VGG16. For all 3 cases, the correct labels rank as the second
highest in terms of the probability. However, by using the specific subset classifier
trained for the mixed subset, the CCIR scheme can push their probabilities to the top.
119
Confusion sets
Top-1 Error (%)
VGG16 Ours
n03642806 laptop
n03832673 notebook
74.00 58.00
n03773504 missile
n04008634 projectile
57.00 40.00
n02808440 bathtub
n04493381 tubt
62.00 51.00
n12144580 corn
n13133613 ear
55.00 41.00
n01871265 tusker
n02504013 Indian elephant
n02504458 African elephant
37.33 27.33
n03085013 computer keyboard
n04264628 space bar
n04505470 typewriter keyboard
47.33 36.00
n02412080 ram
n02415577 bighorn
38.00 28.00
n02109961 Eskimo dog
n02110063 malamute
n02110185 Siberian husky
44.67 34.00
Table 5.2: A list of confusion sets that the proposed CCIR scheme has significant top-1
error reduction on top of the VGG16 under the dense protocol.
Confusion sets
Top-5 Error (%)
VGG16 Ours
n01978287 Dungeness crab
n01978455 rock crab
n01981276 king crab
n01983481 American lobster
n01984695 spiny lobster
n01985128 crayfish
9.37 8.00
n02841315 binoculars
n03657121 lens cap
n03976467 Polaroid camera
n04069434 reflex camera
15.50 13.50
Table 5.3: A list of confusion sets that the proposed CCIR scheme has significant top-5
error reduction on top of the VGG16 under the dense protocol.
120
(a) (b) (c)
ox
barbell
sidewinder
Figure 5.6: Case studies on the ImageNet dataset, where each row represents a test-
ing case. Column (a): the test image with ground truth label. Column (b): top 5
guesses from the VGG16 under the dense protocol. Column (c): top 5 guesses from
the VGG16+CCIR under the dense protocol.
5.3.3 Evaluating Confusing Categories Identification
To evaluate the performance of the proposed anchor-vector-based CCI method, we con-
ducted experiments to study its property from different aspects.
Robustness. As discussed in Sec. 5.2.2, the proposed CCI method is based on the
anchor vectors of baseline network, which is independent of test images. Thus, it is
more robust than methods using the confusion matrix obtained from the hold out data
121
Method Top-1, Top-5 error (STD) (%)
AlexNet+CCIR/CM 41.54 (1.03), 19.04 (0.66)
AlexNet+CCIR/A V 40.87 (0.28), 18.70 (0.13)
VGG16+CCIR/CM 26.58 (0.74), 9.03 (0.25)
VGG16+CCIR/A V 25.73 (0.22), 8.77 (0.14)
Table 5.4: Comparison of CCI methods under the dense protocol, where CM means that
the affinity matrix is generated based on the confusion matrix and A V means the affinity
matrix is generated based on the anchor vectors.
[107]. To study the effect of different methods on the final performance, we conducted
experiments by replacing our anchor-vector-based method with the one relying on the
confusion matrix [107], while keeping other modules in the CCIR method the same.
Specifically, the affinity matrix is calculated asA
1
2
pFF
T
q, whereF is the confusion
matrix. Moreover, we divide the training data of each category into 13 folds, where 12
folds are used to train the network (and obtain anchor vectors) and the remaining one
is used to calculate the confusion matrix. This process is repeated for 13 times with
different folds of data, and their average performance and standard deviation is given
in Table 5.4. We see from the table that both top-1 and top-5 errors will increase if the
affinity matrix is generated based on the confusion matrix rather than the anchor vectors.
While the top-1 and top-5 error rates increase by 0.67% and 0.34% for the AlexNet,
they increase by 0.85% and 0.26% for the VGG16. Moreover, the standard deviation
of the performance of anchor-vector-based method is much lower compared with the
one based on confusion matrix. These results suggest that the confusion-matrix-based
affinity measure is very susceptible to the bias of data. In contrast, the angle between
anchor vectors of the baseline CNN is independent of the hold out data, leading to more
robust performance.
122
Method A V Top-1, Top-5 error (%)
AlexNet+CCIR AlexNet 40.00, 18.52
AlexNet+CCIR VGG16 40.10, 18.58
VGG16+CCIR AlexNet 26.03, 8.71
VGG16+CCIR VGG16 25.26, 8.47
Table 5.5: Confusion hierarchy generalization performance under the dense protocol.
The “A V” column indicates the anchor vectors used in generating the hierarchy.
Confusion hierarchy transferability. As anchor vectors are network specific, it is
interesting to investigate how the anchor-vector-based confusion hierarchy of a partic-
ular network performs on a different one. To study this effect, we conducted experi-
ments that used the anchor vectors of AlexNet to obtain the confusion hierarchy and
then applied the CCR module to the VGG16 network, and vice versa. The classifica-
tion performance with the swapped confusion hierarchy is shown in Table 5.5 along
with that of the original setting. We observe an interesting phenomenon from the table.
Specifically, using AlexNet’s confusion hierarchy for VGG16-based CCIR resulted in
0.77% and 0.24% top-1 and top-5 loss. On the other hand, using the confusion hierar-
chy of VGG16 for AlexNet-based CCIR resulted in much less loss (0.10% for top-1 and
0.06% for top-5). This is because that the VGG16 is a more powerful network compared
to the AlexNet, and the hierarchy based on its anchor vectors is more close to the true
semantic confusion. On the other hand, the hierarchy based on the AlexNet’s anchor
vectors is more specific to the network itself.
5.3.4 Evaluating Confusing Categories Resolution
As described in Sec. 5.2.3, the CCR module consists of two submodules: BTS Cluster-
ing and the mixed subset classifier. We will evaluate these two sub-modules below.
123
Method Top-1, Top-5 error (%)
AlexNet+CCIR/woBTS 40.65, 18.74
AlexNet+CCIR/BTS 40.00, 18.52
VGG16+CCIR/woBTS 25.68, 8.61
VGG16+CCIR/BTS 25.26, 8.47
Table 5.6: Performance of CCIR with (/BTS) and without (/woBTS) BTS Clustering
under the dense protocol.
BTS Clustering. The BTS clustering is used to group similar images within a con-
fusion set, and it forces the subset classifier to find more discriminant subtle differences
between confusing categories. To demonstrate its usefulness, we tested the performance
of a simple baseline, which trains a random forest classifier for each confusion set with
all of its training images. We compare the performance of this method along with that of
the full CCIR system in Table 5.6. Clearly, the removal of BTS clustering leads to con-
siderable performance loss as compared to the CCIR system. Specifically, it increases
the top-1 and top-5 error rates by 0.65% and 0.22% for the AlexNet, respectively. For
VGG16, the corresponding error rate increases are 0.42% and 0.14%, respectively. The
performance loss is expected. Without BTS clustering, all training images within the
confusion set are mixed together and, as a result, the random forest classifier is not able
to capture the discriminative features unique to each subset.
Mixed subset classifier. The 2nd sub-module is the mixed subset classifier. To
investigate its importance, we conducted experiments by replacing the random forest
classifier with a simple probability-based classifier. That is, let the probability of a sam-
ple within subset l taking label k be Ppy k | x P lq
|X
k
l
|
°
k
|X
k
l
|
, whereX
k
l
is set of
training images belonging to thekth class in subsetl. When a test image arrives at the
subsetl, this simple classifier will pickj argmax
k
Ppy k| xP lq as its classifica-
tion label. We compare the performance of the simple probability-based classifier with
that of the random forest classifier in Table 5.7. We see an obvious performance loss on
124
Method Top-1, Top-5 error (%)
AlexNet+CCIR/woRF 41.08, 19.01
AlexNet+CCIR/RF 40.00, 18.52
VGG16+CCIR/woRF 26.25, 8.83
VGG16+CCIR/RF 25.26, 8.47
Table 5.7: Performance comparison of two classifiers for the mixed subset under the
dense protocol, where woRF denotes the probability-based classifier and RF denotes
the random forest classifier.
both networks; namely, 1.08% and 0.49% for the AlextNet and 0.99% and 0.36% for the
VGG16 in terms of top-1 and top-5 error rates, respectively. Clearly, an advanced clas-
sifier with better feature selection capability can better capture the discriminant features
unique to the particular subset.
5.4 Summary
A confusing categories identification and resolution (CCIR) system was proposed to
improve the object classification performance of a baseline CNN in this chapter. An
anchor-vector-based clustering method was introduced to obtain the confusion sets. The
binary-tree-structured clustering was used to split images within each confusion set into
subsets of similar images. Finally, for each mixed subset, a random forest classifier was
trained to capture the subtle differences between confusing categories. It was shown by
experimental results that the proposed CCIR system can boost the performance of object
classification on top of various CNNs significantly.
125
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
Aiming at reducing the semantic gap in Content based Image Retrieval, the Multimodal
Image Retrieval (MIR) becomes a popular approach by embedding tags as semantic
information in training latent subspace. However, neglecting the relative importance of
tags in the image will result in misalignment between visual and tag features, thus lead
to bad retrieval results. Addressing this problem, we first presented a unified framework
that can automatically predict the tag importance from various cues, and embed it in a
multimodal image retrieval system.
In the first part of this work, we focused on embedding the object tag importance into
the MIR system. To achieve this, we first proposed a discounted probability measure-
ment to quantify the object tag importance from human sentence descriptions. Then,
to predict object tag importance, we identified 3 types of cues, which include visual,
semantic and context. The interdependent nature of object tag importance within one
image lead to a structured importance prediction model, whose parameter learning is
achieved by Structured Support Vector Machine. Both objective and subjective results
were presented to demonstrate that our proposed MIR system can not only retrieve
images that preserve the important objects within the query, but also rank them in
decreasing order of importance.
In the second part of this work, we extended our MIR system to consider another
important component of natural images: scene. To measure the scene and object tag
126
importance at the same time, we developed a sentence constituency parsing tree based
method to estimate the relative importance of scene in the image, which was combined
with discounted probability based object tag importance to form the ground truth tag
importance. A subjective test was conducted to validate the consistency between our
measure and human subjective feeling. We extended our original structured model to
further consider the scene tag importance. Various features for predicting scene tag
importance were explored. The superior performance of our whole MIR/PCTI system
over other MIR systems was demonstrated by extensive experimental results.
Besides image retrieval, object classification is another long standing problem in
Computer Vision research field. Recent advancement in deep Convolutional Neural
Network has achieved significant success due to large scale training data and a better
optimization procedure, yet its performance is still not as robust as human. This is
mainly due to two difficulties in object classification with large number of categories:
1) “inter-class similarity” and 2) “intra-class variety”.
To address these problems, my third part of work focused on improving current
CNN-based object classification system from the confusion analysis point of view.
Specifically, we presented a Confusing Categories Identification and Resolution (CCIR)
system, which consists of two submodules. In the first step, the angle between anchor
vectors of two categories was used to measure their degree of confusion, based on which
an affinity graph was built. The confusing categories were subsequently clustered into
confusion sets. As the second step, we adopted a binary-tree-structured (BTS) clustering
method to split a confusion set into multiple visually similar subsets. A classifier was
subsequently learned within each subset to capture its unique discriminative features and
enhance its classification performance. Extensive experiments were conducted on the
ImageNet ILSVRC2012 dataset to demonstrate the superior performance of proposed
CCIR scheme over the AlexNet and the VGG16.
127
6.2 Further Research Directions
6.2.1 Considering more Semantic
Besides object and scene tag importance, there are many other important image contents
that are desired to preserve in MIR framework.
Object Relation
The interactions between objects, also called object relation [36, 85, 113] is an inter-
esting image content that deserves attention. Consider the example shown in Fig. 6.1,
where the left image is the query image with object tags “person”, “horse” and scene tag
“lawn”, and the right two are the candidate database images with same object and scene
tags. Clearly, by only looking at the object and scene tag importance, we can draw the
conclusion that both candidate images are good retrieved results. However, for some
users, this may not be the case. It is obvious that the interaction between “person” and
“horse”, which is the action “riding” is an useful and important component in the query
image. While the first candidate image fail to preserve this information, the second can-
didate image do capture this. Thus, to enhance the user experience, it is necessary to
take the object relations into consideration. For the case of Fig. 6.1, the semantic infor-
mation such as “person riding horse” or “person next to horse” should be embedded into
the MIR framework besides object and scene tag importance.
Attribute
Another interesting semantic that deserves attention is the attribute [26,77,99,109]. For
example, during the query time, user may want explicitly form a phrase like “a smil-
ing woman”, “a luxury living room”, or “a morning beach”. In this case, not only the
object and scene is necessary, the attribute also becomes a very important semantic that
128
(a) Query (b) Candidate 1 (c) Candidate 2
Figure 6.1: An example where the object relation plays an important role in achieving
good retrieval performance. While the query and candidate 2 images have the same
object relation representing action “person riding horse”, the relation between “person”
and “horse” in candidate 1 image is simply geometrical “next to”.
the retrieved results need to preserve. While some attribute ( red in “a red clothes”) is
directly related to the visual property, some others (luxury in “a luxury living room”) are
more abstract and related to human subjective feeling. Embedding the relative impor-
tance of attributes of different objects and scene can be even more challenging.
6.2.2 Other Forms of Retrieval
Another interesting direction is to explore different kinds of retrieval. Intuitively, our
proposed MIR/TIP framework can be easily extended to other types of retrieval such
as Tag based Image Retrieval and Automatic Image Annotation. Recently, other types
of queries such as complex description [10, 59, 78], sentence [13, 27, 49, 55, 116], scene
graph [48] has been proposed. These types of retrievals aim at retrieving images that
match the query exactly. It would be interesting to explore this direction by considering
how tag importance helps boost retrieval performance.
6.2.3 Efficient Large Scale Image Retrieval
In real world scenario, there will be millions or even billions of images available on
the internet. Thus, how to retrieve image efficiently, i.e. find out the relevant images
129
and return to user in a reasonable time, is very important. This is usually achieved by
building efficient index structure [17, 54, 102, 112] and performing hashing [14, 103]
during the query time. Bringing tag importance into retrieval framework will inevitably
change the criteria of efficient index structure, and there may be potential new hashing
algorithms developed specific to tag importance based retrieval.
6.2.4 Extending the CCIR to more advanced network
In recent years, more advanced and deeper networks with better performance on the
ImageNet ILSVRC2012 dataset have been published. Currently, the network with the
best performance on the ImageNet ILSVRC2012 dataset is ResNet [39]. With network
structure of 152 layers, it can achieve 21.4% and 5.7% top-1 and top-5 error under the
10-crop evaluation protocol, which outperforms the VGG16 by a large margin. Since
CCIR is built on top of any CNN to address its weakness in confusing categories classi-
fication, it would be interesting to see how much performance improvement CCIR can
obtain on top of the ResNet.
6.2.5 Improving Top-5 performance
As discussed in Sec. 5.3, the top-5 performance of CCIR is restricted by the smaller size
of confusion sets. This is understandable as smaller confusion sets sacrifice the Top-5
performance for better Top-1 performance. The major problem is that if a test image is
wrongly classified by the baseline CNN, it is possible to be routed to a wrong confusion
set and it is unlikely this error will be recovered. However, it is plausible to address
this problem using a probabilistic approach like [107]. Specifically, during testing, we
can assign a test image to multiple confusion sets corresponding to the top-5 predicted
categories of the CNN. Afterwards, each confusion set will go through the resolution
step to obtain their predicted labels with probability scores. Finally, output probabilities
130
of the baseline CNN will be used to weighted average the probability predictions of all
confusion sets to obtain the final CCIR prediction.
6.2.6 Multi-level Confusion Hierarchy
Current CCIR system is only applicable to a two-level confusion hierarchy. However,
image data and categories are usually organized in terms of multi-level tree hierar-
chy [3, 19, 72, 117]. The classification problem becomes more interesting if the per-
formance evaluation is not restricted to the leaf nodes of hierarchy, but also the internal
nodes. Particularly, classifying two species of “husky” is very difficult for human, and
it is generally acceptable to classify them just as “husky”. This kind of hierarchical
classification problem defines a new loss function for the whole system. Intuitively,
classifying a “Siberian husky” as an “Alaska husky” should not be penalized the same
as classifying it as a “Persian cat”. Consequently, new schemes must be developed in
order to maximize the performance of CCIR on this type of problem.
131
Bibliography
[1] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation anal-
ysis. In Proceedings of the 30th International Conference on Machine Learning,
pages 1247–1255, 2013.
[2] G. H. Bakir, T. Hofmann, B. Sch¨ olkopf, A. J. Smola, B. Taskar, and S. V . N.
Vishwanathan. Predicting Structured Data (Neural Information Processing). The
MIT Press, 2007.
[3] H. Bannour and C. Hudelot. Hierarchical image annotation using semantic hier-
archies. In Proceedings of the 21st ACM international conference on Information
and knowledge management, pages 2431–2434. ACM, 2012.
[4] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jor-
dan. Matching words and pictures. The Journal of Machine Learning Research,
3:1107–1135, 2003.
[5] Y . Bengio. Learning deep architectures for ai. Foundations and trends R
in
Machine Learning, 2(1):1–127, 2009.
[6] Y . Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. Greedy layer-wise
training of deep networks. Advances in neural information processing systems,
19:153, 2007.
[7] A. C. Berg, T. L. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch,
M. Mitchell, A. Sood, K. Stratos, et al. Understanding and predicting importance
in images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 3562–3569. IEEE, 2012.
[8] T. Berg and A. C. Berg. Finding iconic images. In Computer Vision and Pattern
Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society
Conference on, pages 1–8. IEEE, 2009.
[9] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python.
O’Reilly Media, 2009.
132
[10] X. Cao, X. Wei, X. Guo, Y . Han, and J. Tang. Augmented image retrieval using
multi-order object layout with attributes. In Proceedings of the ACM Interna-
tional Conference on Multimedia, pages 1093–1096. ACM, 2014.
[11] X. Chen, H. Fang, T.-Y . Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick.
Microsoft coco captions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325, 2015.
[12] J. Costa Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy,
and N. Vasconcelos. On the role of correlation and abstraction in cross-modal
multimedia retrieval. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 36(3):521–535, 2014.
[13] B. Coyne, R. Sproat, and J. Hirschberg. Spatial relations in text-to-scene conver-
sion. In Computational Models of Spatial Language Interpretation, Workshop at
Spatial Cognition. Citeseer, 2010.
[14] M. Datar, N. Immorlica, P. Indyk, and V . S. Mirrokni. Locality-sensitive hashing
scheme based on p-stable distributions. In Proceedings of the twentieth annual
symposium on Computational geometry, pages 253–262. ACM, 2004.
[15] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and
trends of the new age. ACM Computing Surveys (CSUR), 40(2):5, 2008.
[16] R. Datta, J. Li, and J. Z. Wang. Content-based image retrieval: approaches and
trends of the new age. In Proceedings of the 7th ACM SIGMM international
workshop on Multimedia information retrieval, pages 253–262. ACM, 2005.
[17] J. Deng, A. C. Berg, and L. Fei-Fei. Hierarchical semantic indexing for large
scale image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on, pages 785–792. IEEE, 2011.
[18] J. Deng, N. Ding, Y . Jia, A. Frome, K. Murphy, S. Bengio, Y . Li, H. Neven,
and H. Adam. Large-scale object classification using label relation graphs. In
European Conference on Computer Vision, pages 48–64. Springer, 2014.
[19] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR09, 2009.
[20] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-
scale hierarchical image database. In Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
[21] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree
learning for large scale object recognition. In Advances in Neural Information
Processing Systems, pages 567–575, 2011.
133
[22] P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth. Object recognition as
machine translation: Learning a lexicon for a fixed image vocabulary. In Com-
puter VisionECCV 2002, pages 97–112. Springer, 2002.
[23] D. Eigen, J. Rolfe, R. Fergus, and Y . LeCun. Understanding deep architectures
using a recursive convolutional network. arXiv preprint arXiv:1312.1847, 2013.
[24] L. Elazary and L. Itti. Interesting objects are visually salient. Journal of vision,
8(3):3, 2008.
[25] C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features
for scene labeling. IEEE transactions on pattern analysis and machine intelli-
gence, 35(8):1915–1929, 2013.
[26] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their
attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
Conference on, pages 1778–1785. IEEE, 2009.
[27] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier,
and D. Forsyth. Every picture tells a story: Generating sentences from images.
In Computer Vision–ECCV 2010, pages 15–29. Springer, 2010.
[28] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part-based models. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
[29] R. Fergus, H. Bernal, Y . Weiss, and A. Torralba. Semantic label sharing for
learning with many categories. In European Conference on Computer Vision,
pages 762–775. Springer, 2010.
[30] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1440–1448, 2015.
[31] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 580–587, 2014.
[32] Y . Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space
for modeling internet images, tags, and their semantics. International journal of
computer vision, 106(2):210–233, 2014.
[33] Y . Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving
image-sentence embeddings using large weakly annotated photo collections. In
Computer Vision–ECCV 2014, pages 529–545. Springer, 2014.
134
[34] I. Goodfellow, Y . Bengio, and A. Courville. Deep learning. Book in preparation
for MIT Press, 2016.
[35] G. Griffin and P. Perona. Learning and using taxonomies for fast visual catego-
rization. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE
Conference on, pages 1–8. IEEE, 2008.
[36] A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and compar-
ative adjectives for learning visual classifiers. In Computer Vision–ECCV 2008,
pages 16–29. Springer, 2008.
[37] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analy-
sis: An overview with application to learning methods. Neural computation,
16(12):2639–2664, 2004.
[38] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convo-
lutional networks for visual recognition. In European Conference on Computer
Vision, pages 346–361. Springer, 2014.
[39] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni-
tion. arXiv preprint arXiv:1512.03385, 2015.
[40] G. E. Hinton, S. Osindero, and Y .-W. Teh. A fast learning algorithm for deep
belief nets. Neural computation, 18(7):1527–1554, 2006.
[41] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on,
pages 1–8. IEEE, 2007.
[42] S. J. Hwang and K. Grauman. Accounting for the relative importance of objects
in image retrieval. In BMVC, pages 1–12, 2010.
[43] S. J. Hwang and K. Grauman. Learning the relative importance of objects from
tagged images for retrieval and cross-modal search. International journal of com-
puter vision, 100(2):134–153, 2012.
[44] S. J. Hwang, K. Grauman, and F. Sha. Analogy-preserving semantic embedding
for visual object categorization. In ICML (3), pages 639–647, 2013.
[45] J. Jeon, V . Lavrenko, and R. Manmatha. Automatic image annotation and retrieval
using cross-media relevance models. In Proceedings of the 26th annual inter-
national ACM SIGIR conference on Research and development in informaion
retrieval, pages 119–126. ACM, 2003.
135
[46] Y . Jia, J. T. Abbott, J. Austerweil, T. Griffiths, and T. Darrell. Visual concept
learning: Combining machine vision and bayesian generalization on concept
hierarchies. In Advances in Neural Information Processing Systems, pages 1842–
1850, 2013.
[47] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms.
Machine Learning, 77(1):27–59, 2009.
[48] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-
Fei. Image retrieval using scene graphs. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3668–3678, 2015.
[49] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating
image descriptions. arXiv preprint arXiv:1412.2306, 2014.
[50] K. Kavukcuoglu, P. Sermanet, Y .-L. Boureau, K. Gregor, M. Mathieu, and Y . L.
Cun. Learning convolutional feature hierarchies for visual recognition. In
Advances in neural information processing systems, pages 1090–1098, 2010.
[51] P. Kr¨ ahenb¨ uhl, C. Doersch, J. Donahue, and T. Darrell. Data-dependent initializa-
tions of convolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.
[52] R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalan-
tidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome:
Connecting language and vision using crowdsourced dense image annotations.
2016.
[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
[54] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable
image search. In Computer Vision, 2009 IEEE 12th International Conference
on, pages 2130–2137. IEEE, 2009.
[55] G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y . Choi, A. C. Berg, and T. L. Berg.
Baby talk: Understanding and generating image descriptions. In Proceedings of
the 24th CVPR. Citeseer, 2011.
[56] C.-C. J. Kuo. Understanding convolutional neural networks with a mathematical
model. arXiv preprint arXiv:1609.04112, 2016.
[57] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Prob-
abilistic models for segmenting and labeling sequence data. Proceedings of the
18th International Conference on Machine Learning (ICML-2001), 2001.
136
[58] T. Lan and G. Mori. A max-margin riffled independence model for image tag
ranking. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Con-
ference on, pages 3103–3110. IEEE, 2013.
[59] T. Lan, W. Yang, Y . Wang, and G. Mori. Image retrieval with structured object
queries using latent ranking svm. In Computer Vision–ECCV 2012, pages 129–
142. Springer, 2012.
[60] Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[61] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Classi-
fication, annotation and segmentation in an automatic framework. In Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages
2036–2043. IEEE, 2009.
[62] L.-J. Li, H. Su, L. Fei-Fei, and E. P. Xing. Object bank: A high-level image repre-
sentation for scene classification & semantic feature sparsification. In Advances
in neural information processing systems, pages 1378–1386, 2010.
[63] L.-J. Li, C. Wang, Y . Lim, D. M. Blei, and L. Fei-Fei. Building and using a seman-
tivisual image hierarchy. In Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pages 3336–3343. IEEE, 2010.
[64] S. Z. Li. Markov random field modeling in computer vision. Springer Science &
Business Media, 2012.
[65] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar,
and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer
Vision–ECCV 2014, pages 740–755. Springer, 2014.
[66] B. Liu, F. Sadeghi, M. Tappen, O. Shamir, and C. Liu. Probabilistic label trees for
efficient large scale image classification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 843–850, 2013.
[67] D. Liu, X.-S. Hua, L. Yang, M. Wang, and H.-J. Zhang. Tag ranking. In Proceed-
ings of the 18th international conference on World wide web, pages 351–360.
ACM, 2009.
[68] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky.
The Stanford CoreNLP natural language processing toolkit. In Proceedings of
52nd Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, pages 55–60, 2014.
137
[69] D. K. C. D. Manning. Natural language parsing. In Advances in Neural Infor-
mation Processing Systems 15: Proceedings of the 2002 Conference, volume 15,
page 3. MIT Press, 2003.
[70] M. Marszalek and C. Schmid. Semantic hierarchies for visual object recognition.
In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages
1–7. IEEE, 2007.
[71] M. Marszałek and C. Schmid. Constructing category hierarchies for visual recog-
nition. In European Conference on Computer Vision, pages 479–491. Springer,
2008.
[72] G. A. Miller. Wordnet: a lexical database for english. Communications of the
ACM, 38(11):39–41, 1995.
[73] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[74] A. Y . Ng, M. I. Jordan, Y . Weiss, et al. On spectral clustering: Analysis and an
algorithm. Advances in neural information processing systems, 2:849–856, 2002.
[75] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y . Ng. Multimodal deep
learning. In Proceedings of the 28th international conference on machine learn-
ing (ICML-11), pages 689–696, 2011.
[76] S. Nowozin and C. H. Lampert. Structured learning and prediction in computer
vision. Foundations and Trends R
in Computer Graphics and Vision, 6(3–4):185–
365, 2011.
[77] D. Parikh and K. Grauman. Relative attributes. In Computer Vision (ICCV), 2011
IEEE International Conference on, pages 503–510. IEEE, 2011.
[78] N. Pourian and B. Manjunath. Retrieval of images with objects of specific size,
location, and spatial configuration. In Applications of Computer Vision (WACV),
2015 IEEE Winter Conference on, pages 960–967. IEEE, 2015.
[79] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image
annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT
2010 Workshop on Creating Speech and Language Data with Amazon’s Mechan-
ical Turk, pages 139–147. Association for Computational Linguistics, 2010.
[80] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy,
and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Pro-
ceedings of the international conference on Multimedia, pages 251–260. ACM,
2010.
138
[81] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In Advances in neural information pro-
cessing systems, pages 91–99, 2015.
[82] Y . Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques,
promising directions, and open issues. Journal of visual communication and
image representation, 10(1):39–62, 1999.
[83] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet
Large Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[84] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a database
and web-based tool for image annotation. International journal of computer
vision, 77(1-3):157–173, 2008.
[85] M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In Computer
Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1745–
1752. IEEE, 2011.
[86] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual
appearance for multiclass object detection. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, pages 1481–1488. IEEE, 2011.
[87] A. Schrijver. Theory of linear and integer programming. John Wiley & Sons,
1998.
[88] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat:
Integrated recognition, localization and detection using convolutional networks.
arXiv preprint arXiv:1312.6229, 2013.
[89] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[90] J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsuper-
vised discovery of visual object class hierarchies. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
[91] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based
image retrieval at the end of the early years. Pattern Analysis and Machine Intel-
ligence, IEEE Transactions on, 22(12):1349–1380, 2000.
[92] M. Spain and P. Perona. Measuring and predicting object importance. Interna-
tional Journal of Computer Vision, 91(1):59–76, 2011.
139
[93] N. Srivastava and R. R. Salakhutdinov. Discriminative transfer learning with
tree-based priors. In Advances in Neural Information Processing Systems, pages
2094–2102, 2013.
[94] C. Sutton and A. McCallum. An introduction to conditional random fields for
relational learning. Introduction to statistical relational learning, pages 93–128,
2006.
[95] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9,
2015.
[96] B. Taskar, V . Chatalbashev, D. Koller, and C. Guestrin. Learning structured pre-
diction models: A large margin approach. In Proceedings of the 22nd interna-
tional conference on Machine learning, pages 896–903. ACM, 2005.
[97] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y . Altun. Support vector
machine learning for interdependent and structured output spaces. In Proceed-
ings of the twenty-first international conference on Machine learning, page 104.
ACM, 2004.
[98] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y . Altun. Large margin meth-
ods for structured and interdependent output variables. In Journal of Machine
Learning Research, pages 1453–1484, 2005.
[99] N. Turakhia and D. Parikh. Attribute dominance: What pops out? In Computer
Vision (ICCV), 2013 IEEE International Conference on, pages 1225–1232. IEEE,
2013.
[100] N. Verma, D. Mahajan, S. Sellamanickam, and V . Nair. Learning hierarchical
similarity metrics. In Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pages 2280–2287. IEEE, 2012.
[101] A. V outilainen. Part-of-speech tagging. The Oxford handbook of computational
linguistics, pages 219–232, 2003.
[102] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image
retrieval. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Con-
ference on, pages 3424–3431. IEEE, 2010.
[103] Y . Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in neural
information processing systems, pages 1753–1760, 2009.
[104] G. K. Wonjoon Goo, Juyong Kim and S. J. Hwang. Taxonomy-regularized
semantic deep convolutional neural networks. In ECCV, 2016.
140
[105] J. Xiao, J. Hays, K. Ehinger, A. Oliva, A. Torralba, et al. Sun database: Large-
scale scene recognition from abbey to zoo. In Computer vision and pattern recog-
nition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.
[106] T. Xiao, J. Zhang, K. Yang, Y . Peng, and Z. Zhang. Error-driven incremental
learning in deep convolutional neural network for large-scale image classification.
In Proceedings of the 22nd ACM international conference on Multimedia, pages
177–186. ACM, 2014.
[107] Z. Yan, H. Zhang, R. Piramuthu, V . Jagadeesh, D. DeCoste, W. Di, and Y . Yu.
Hd-cnn: Hierarchical deep convolutional neural network for large scale visual
recognition. In ICCV’15: Proc. IEEE 15th International Conf. on Computer
Vision, 2015.
[108] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object
detection, scene classification and semantic segmentation. In Computer Vision
and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 702–709.
IEEE, 2012.
[109] F. X. Yu, R. Ji, M.-H. Tsai, G. Ye, and S.-F. Chang. Weak attributes for large-
scale image retrieval. In Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pages 2949–2956. IEEE, 2012.
[110] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks
for mid and high level feature learning. In 2011 International Conference on
Computer Vision, pages 2018–2025. IEEE, 2011.
[111] H. Zhang, Y . Yang, H. Luan, S. Yang, and T.-S. Chua. Start from scratch: Towards
automatically identifying, modeling, and naming visual attributes. In Proceed-
ings of the ACM International Conference on Multimedia, pages 187–196. ACM,
2014.
[112] S. Zhang, M. Yang, X. Wang, Y . Lin, and Q. Tian. Semantic-aware co-indexing
for image retrieval. In Computer Vision (ICCV), 2013 IEEE International Con-
ference on, pages 1673–1680. IEEE, 2013.
[113] Y . Zhang, Z. Jia, and T. Chen. Image retrieval with geometry-preserving visual
phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Con-
ference on, pages 809–816. IEEE, 2011.
[114] B. Zhao, F. Li, and E. P. Xing. Large-scale category structure aware image cate-
gorization. In Advances in Neural Information Processing Systems, pages 1251–
1259, 2011.
141
[115] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features
for scene recognition using places database. In Advances in neural information
processing systems, pages 487–495, 2014.
[116] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation
of sentences. In Computer Vision (ICCV), 2013 IEEE International Conference
on, pages 1681–1688. IEEE, 2013.
[117] A. Zweig and D. Weinshall. Exploiting object hierarchy: Combining models
from different category levels. In 2007 IEEE 11th International Conference on
Computer Vision, pages 1–8. IEEE, 2007.
142
Abstract (if available)
Abstract
Computer vision has achieved a major breakthrough in recent years with the advancement of deep learning based methods. However, its performance is still yet to be claimed as robust for practical applications, and more advanced methods on top of deep learning architecture are needed. This work targets at using deep learning features to tackle two major computer vision problems: Multimodal Image Retrieval and Object Classification. ❧ Multimodal Image Retrieval (MIR) aims at building the alignment between the visual and textual modalities, thus reduce the well known ""semantic gap"" in image retrieval problem. As the most widely existing textual information of images, tag plays an important semantic role in MIR framework. However, treating all tags in an image as equally important may result in misalignment between visual and textual domains, leading to bad retrieval performance. To address this problem and build a robust retrieval system, we propose an MIR framework that embeds tag importance as the textual feature. In the first part, we propose an MIR system, called Multimodal Image Retrieval with Tag Importance Prediction (MIR/TIP), to embed the automatically predicted object tag importance in image retrieval. To achieve this goal, a discounted probability metric is first presented to measure the object tag importance from human sentence descriptions. Using this as ground truth, a structured object tag importance prediction model is proposed. The proposed model integrates visual, semantic, and context cues to achieve robust object tag importance prediction performance. Our experimental results demonstrate that, by embedding the predicted object tag importance, significant performance gain can be obtained in terms of both objective and subjective evaluation. In the second part, the MIR/TIP system is extended to account ""scene"", which is another important aspect of image. To jointly measure the scene and object tag importance, the discounted probability metric is modified to consider the grammatical role of the scene tag in the human annotated sentence. The structured model is modified to predict the scene and object tag importance at the same time. Our experimental results demonstrate that the robustness of the MIR system is greatly enhanced by our predicted scene and object tag importance. ❧ Object classification is a long-standing problem in the computer vision field, which serves as the foundation for other problems such as object detection, scene classification, and image annotation. As the number of object categories continues to increase, it is inevitable to have certain categories that are more confusing than others due to the proximity of their samples in the feature space. In the third part, we conduct a detail analysis on confusing categories and propose a confusing categories identification and resolution (CCIR) scheme, which can be applied to any CNN-based object classification baseline method to further improve its performance. In the CCIR scheme, we first present a procedure to cluster confusing object categories together to form a confusion set automatically. Then, a binary-tree-structured (BTS) clustering method is adopted to split a confusion set into multiple subsets. A classifier is subsequently learned within each subset to enhance its performance. Experimental results on the ImageNet ILSVRC2012 dataset show that the proposed CCIR scheme can offer a significant performance gain over the AlexNet and the VGG16.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Multimodal representation learning of affective behavior
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Detecting semantic manipulations in natural and biomedical images
PDF
Classification and retrieval of environmental sounds
PDF
A learning‐based approach to image quality assessment
PDF
Unsupervised domain adaptation with private data
PDF
Efficient graph learning: theory and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Facial age grouping and estimation via ensemble learning
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Learning shared subspaces across multiple views and modalities
PDF
Transfer learning for intelligent systems in the wild
Asset Metadata
Creator
Li, Shangwen
(author)
Core Title
Multimodal image retrieval and object classification using deep learning features
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2017-05
Publication Date
02/03/2017
Defense Date
12/01/2016
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
confusing categories analysis,cross-domain learning,deep learning,image retrieval,importance measure,importance prediction,MIR,multimodal image retrieval,OAI-PMH Harvest,object classification,semantic gap,tag importance
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Georgiou, Panayiotis (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
shangwel@usc.edu,shangwenli.usc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255611
Unique identifier
UC11255611
Identifier
etd-LiShangwen-4991.pdf (filename)
Legacy Identifier
etd-LiShangwen-4991
Dmrecord
328783
Document Type
Dissertation
Format
theses (aat)
Rights
Li, Shangwen
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
confusing categories analysis
cross-domain learning
deep learning
image retrieval
importance measure
importance prediction
MIR
multimodal image retrieval
object classification
semantic gap
tag importance