Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Local-aware deep learning: methodology and applications
(USC Thesis Other)
Local-aware deep learning: methodology and applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOCAL-AWARE DEEP LEARNING: METHODOLOGY AND APPLICATIONS
by
Heming Zhang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2020
Copyright 2020 Heming Zhang
Contents
List of Tables v
List of Figures vii
Abstract xi
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Visual Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Fashion Representation Extraction . . . . . . . . . . . . . . . . 5
1.1.4 Fashion Outfit Compatibility Learning . . . . . . . . . . . . . . 6
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Generative Visual Dialogue System with Attention . . . . . . . 7
1.2.2 Leveraging Local Cues for Face Detection on Mobile Devices . 7
1.2.3 Fashion Analysis using Local Region Characteristics . . . . . . 8
1.2.4 Fashion Outfit Compatibility Learning with Global and Local
Supervisions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 9
2 Generative Visual Dialogue System with Attention 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Proposed Generative VD System . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Adaptive Multi-modal Reasoning (AMR) . . . . . . . . . . . . 15
2.3.2 WLE-based Training Scheme . . . . . . . . . . . . . . . . . . 18
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 21
2.4.3 Experiments Results and Analysis . . . . . . . . . . . . . . . . 22
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ii
3 Leveraging Local Cues for Face Detection on Mobile Devices 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Traditional Face Detectors . . . . . . . . . . . . . . . . . . . . 30
3.2.2 CNN-based Face Detectors . . . . . . . . . . . . . . . . . . . . 31
3.3 Proposed Method using Local Facial Characteristics . . . . . . . . . . . 34
3.3.1 Proposal Network . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Proposal Generation . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Model Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Evaluation of Model Size . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Evaluation of Face Detection . . . . . . . . . . . . . . . . . . . 42
3.4.4 Evaluation of Runtime Efficiency . . . . . . . . . . . . . . . . 44
3.4.5 Evaluation of Training Acceleration . . . . . . . . . . . . . . . 46
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Fashion Analysis Using Local Region Characteristics 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Representative Region Detection . . . . . . . . . . . . . . . . . 52
4.3.2 Local Feature Extraction . . . . . . . . . . . . . . . . . . . . . 54
4.3.3 Data Collection and Web Attributes Dataset . . . . . . . . . . . 56
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Fashion Outfit Compatibility Learning with Global and Local Supervisions 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Color Compatibility . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.2 Fashion Outfit Compatibility . . . . . . . . . . . . . . . . . . . 66
5.2.3 Graph Convolutional Networks . . . . . . . . . . . . . . . . . 67
5.3 Learning Color Compatibility in Outfits . . . . . . . . . . . . . . . . . 68
5.3.1 Color Feature Extraction . . . . . . . . . . . . . . . . . . . . . 69
5.3.2 Graph Construction and Embedding . . . . . . . . . . . . . . . 69
5.3.3 Compatibility Prediction and Outfit Clustering . . . . . . . . . 71
5.3.4 Global and local supervisions . . . . . . . . . . . . . . . . . . 73
5.4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 77
5.4.3 Comparison with Previous Work . . . . . . . . . . . . . . . . . 77
iii
5.4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.5 Outfit Cluster Visualization . . . . . . . . . . . . . . . . . . . . 81
5.4.6 Fashion Recommendation Based on Color Compatibility . . . . 85
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Conclusion and Future Work 86
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Context-aware Attention Mechanism . . . . . . . . . . . . . . 88
6.2.2 Attention Mechanism on Graphs . . . . . . . . . . . . . . . . . 90
Bibliography 93
iv
List of Tables
2.1 Performance of generative models on VisDial 0.9. ‘Mean’ denotes mean
rank, for which lower is better. All the models use VGG as backbone
except for Coref which uses ResNet. . . . . . . . . . . . . . . . . . . . 20
2.2 Performance of generative models on VisDial v1.0 val. Results of pre-
vious work are reported by ReDAN. . . . . . . . . . . . . . . . . . . . 22
2.3 Ablation study on VisDial 0.9. Top: absolute values. Bottom: improve-
ment from MLE models. . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Parameters of the proposal network. . . . . . . . . . . . . . . . . . . . 38
3.2 Proposal network without acceleration . . . . . . . . . . . . . . . . . . 38
3.3 Proposal network with acceleration . . . . . . . . . . . . . . . . . . . . 39
3.4 Summary of training data used in this work. . . . . . . . . . . . . . . . 42
3.5 Comparisons of model sizes of several face detectors. . . . . . . . . . . 42
3.6 Comparisons of detection performance with the MTCNN [134] on the
WIDER-face validation set [125] with different scale factors. . . . . . . 43
4.1 Comparison of fashion datasets without human annotation . . . . . . . 56
4.2 Comparison of accuracy of the attribute classification tasks. . . . . . . . 57
4.3 The performance of the customized Fast RCNN on the Clothing Parsing
dataset with the test split. . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Comparison with previous methods using color palettes or deep features
as image representation. For Siamese Net, we adopt the performances
reported in [106]. All the methods extract deep features from ResNet-18. 79
v
5.2 Comparison with previous methods. Text information is additionally
utilized during training as regularization. For Bi-LSTM and Siamese
Net, we adopt the performances reported in [106]. All the methods
extract deep features from ResNet-18, except for Bi-LSTM, which uses
Inception-v3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Ablations studies on how each proposed module contributes to the final
performance. The experiments are conducted on the validation set of
Polyvore Outfits with color palettes as fashion representations. . . . . . 80
vi
List of Figures
1.1 Example of visual dialogue. The human user asks questions regarding
the image and the machine answers the questions. A dialogue consists
of multiple rounds of questions and answers. . . . . . . . . . . . . . . . 3
1.2 Example of face detection. The bounding boxes in green color around
the faces are desired prediction outputs. . . . . . . . . . . . . . . . . . 4
1.3 Example of fashion images. The background can be either simple or in
the wild. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Examples of compatible fashion outfits. . . . . . . . . . . . . . . . . . 6
2.1 (a) An example from the VisDial dataset, and (b) comparison between
MLE, GAN and WLE, where positive responses are highlighted in blue.
The MLE-based generator learn from data in positive answers only. The
GAN-based generator learn from data in negative answers through dis-
criminators indirectly. Our WLE-based generator learn from data in
both positive and negative answers. . . . . . . . . . . . . . . . . . . . . 11
2.2 The adaptive multi-modal reasoning. . . . . . . . . . . . . . . . . . . . 17
2.3 Examples of top-10 responses ranked by our model. When there are
multiple correct responses to the question, our model may choose other
candidates that are semantically similar to the human response. The
human responses are highlighted in blue. . . . . . . . . . . . . . . . . . 21
2.4 Results of the top-10 teams in the first visual dialog challenge. As
the only team in top-10 uses generative visual dialogue system, we are
ranked as the 6th place (highlighted with gray color). Our NDCG score
is comparable with other discriminative systems. . . . . . . . . . . . . 22
2.5 Visualization of image attention heatmaps for different questions and
reasoning steps. Regions of attention are highlighted in blue. . . . . . . 25
vii
2.6 Qualitative results on test. The questions and answers are truncated
at 16 and 8, respectively, same as our data pre-processing. . . . . . . . 26
3.1 Comparison between our method and a typical CNN cascade framework
(TCCF). The main difference is that local characteristics such as eyes,
nose, mouth, etc. are captured in our method so that a single level of the
pyramid can encode multiple scales of faces. The number of pyramid
levels is thus reduced to speed up the proposal generation process. . . . 33
3.2 Illustration of the proposal network. In the training stage, face and facial
part patches are randomly cropped from training images and used to
train a multi-label classification network. In the testing stage, a test
image is resized to form a sparse pyramid, and fed into the multi-label
classification network to generate heatmaps of face and facial parts.
Based on the heatmaps and bounding box templates of each facial part,
we can generate face proposals. These face proposals will be sent to the
next stage of the CNN cascade. . . . . . . . . . . . . . . . . . . . . . . 35
3.3 The pipleline of our proposal generation process, where the eye and the
mouth are adopted to illustrate facial parts. . . . . . . . . . . . . . . . . 37
3.4 Comparison of the standard and the accelerated training schemes for the
proposal network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Evaluation results on FDDB. . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Comparison of convergence curves, where the x-axis is the number of
iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 The pipeline of the proposed fashion feature extraction system. . . . . . 51
4.2 The architecture of the customized Fast RCNN. . . . . . . . . . . . . . 52
4.3 The local feature extraction module. . . . . . . . . . . . . . . . . . . . 55
4.4 Exemplary images of the Web Attributes fashion dataset. . . . . . . . . 59
4.5 Comparison of the top-5 retrieval results between the Stylenet [91] and
the proposed feature extractor. . . . . . . . . . . . . . . . . . . . . . . 60
viii
5.1 An overview of our proposed method for learning color compatibility in
fashion outfits. First color palettes are extracted from each fashion item
image as the fashion feature. Then a graph is constructed such that each
node represents the pairwise relation between two fashion items. After-
wards the constructed graph together with the extracted fashion features
are embedded into a single vector embedding, which is used for the final
compatibility prediction. The joint training scheme for compatibility
prediction and outfit clustering help improve the prediction performance
and interpretability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 (a)The graph construction step by modelling each pairwise item rela-
tion as a node. The node features are obtained by embedding two item
features. (b) The graph embedding step with GCN. . . . . . . . . . . . 70
5.3 Illustration of the proposed joint compatibility prediction and outfit clus-
tering method. We alternatively generate pseudo labels for every sample
and update the network parameters using the pseudo labels. . . . . . . . 72
5.4 Graph constructions for global and local supervision of a given pair of
samples. For global supervision, all items in the outfits are used. For
local supervision, common items in the positive-negative sample pair
are not used. Eliminated nodes and edges are denoted in dotted lines. . . 74
5.5 Compatible outfit clusters predicted using color palettes as fashion rep-
resentation. Due to limited space, we only display 50 outfits per cluster
and at most 10 items per outfit. The 50 outfits are shown on the left
where items in the same column belong to the same outfit. Two outfit
samples in the bounding boxes are enlarged and shown on the right. . . 81
5.6 A compatible outfit cluster predicted using deep image features that
reveals a color pattern. The items in the same column belong to the
same outfit. The visualization suggests that the items of the outfits in
this cluster have similar color. . . . . . . . . . . . . . . . . . . . . . . 82
5.7 TSNE visualization of the outfit embeddings obtained from a model
trained with one negative cluster. The black points are the samples
whose compatibility are wrongly predicted and the points in other colors
are the samples whose compatibility are correctly predicted. Further-
more, each color indicates a different cluster predicted by the trained
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.8 Left: fashion items of the given incomplete outfits. Right: fashion item
recommendation based on different patterns of compatible colors. . . . 84
ix
6.1 Illustration of conventional attention methods. The input is a three
dimensional tensor which haswh features of sizec 1, representing
features ofwh regions. The output is a weighted sum feature of size
c 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Illustration of assigning weights in conventional attention methods. . . . 89
6.3 Attention visualization on the object “person”. Darker blue region denotes
for region assigned with higher weight. . . . . . . . . . . . . . . . . . 90
6.4 Illustration of manually masking out shared nodes for a pair of outfits in
Sec. 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
x
Abstract
Deep learning techniques utilize networks with multiple layers cascaded to map the
inputs to desired outputs. To map the entire inputs to desired outputs, useful information
should be extracted through the layers. During the mapping, feature extraction and
prediction are jointly performed. We do not have direct control for feature extraction.
Consequently, some useful information, especially local information, is also discarded
in the process.
In this thesis, we specifically study local-aware deep learning techniques from four
different aspects:
1. Local-aware network architecture
2. Local-aware proposal generation
3. Local-aware region analysis
4. Local-aware supervision
Specifically, we design a multi-modal attention mechanism for generative visual dia-
logue system in Chapter 2. The visual dialogue system holds a dialogue between human
and machine. A generative visual dialogue system takes an image, a sentence in one
xi
round of dialogue and the dialogue in the past rounds as inputs, and generates the cor-
responding response to continue the dialogue. Our proposed local-aware network archi-
tecture is able to simultaneously attend to those multi-modal inputs and utilize extracted
local information to generate dialogue responses.
We propose a proposal network for fast face detection system for mobile devices in
Chapter 3. A face detection system on mobile devices has many challenges including
high accuracy, fast inference speed and small model size due to limited computation
power and storage space of mobile devices. Our proposed local-aware proposal genera-
tion module is able to detect salient facial parts and use them as local cues for detection
of entire faces. It accelerates the inference speed and does not result in much burden on
model size.
We extract representative fashion features by analyzing local regions in Chapter 4.
Many fashion attributes, such as the shape of the collar, the length of the sleeves, the
pattern of the prints, etc., can only be found in local regions. Our proposed local-aware
region analysis extracts representative fashion features from different levels of the deep
network, so that the fashion features extracted contain many local fashion details of
human’s interests.
We develop a fashion outfit compatibility learning method with local graphs in Chap-
ter 5. When modeling a fashion outfit as a graph, the network that learns the compati-
bility on the entire outfit graphs only may ignore some subtle differences among outfits.
Our proposed local-aware supervision includes the construction of local graphs and the
corresponding local loss function. The local graphs are constructed from partial outfits.
Then the network trained with the local loss function on the local graphs is able to learn
the subtle difference of compatibility in fashion outfits data.
xii
Chapter 1
Introduction
1.1 Significance of the Research
In recent years, deep learning, as part of machine learning methods, has gained great
interest from researchers. To replace hand-crafted features, deep learning techniques
use a cascade of multiple layers of nonlinear processing units (network) to automatically
extract deep features. In the end-to-end setting, deep learning models jointly achieve
feature representation and task completion.
Different deep learning architectures, such as convolutional neural networks, recur-
rent neural networks and graph convolutional networks have been applied to fields
including computer vision, natural language processing, speech recognition, etc. Deep
learning techniques have shown supreme performance over humans on some tasks, such
as image classification [54, 93, 31, 38], face recognition [88, 83, 115]. Nevertheless,
deep learning techniques still have large performance gaps compared to humans on
many tasks.
Ordinary deep learning techniques are aimed for mapping the entire inputs to desired
outputs. During the joint feature representation and prediction, global information is
extracted and utilized through the deep network layers, while some local information
may be lost. However, local information is also very important in different aspects:
Local information may be the key of solving specific problems, such as in
visual question answering [118, 128, 70] and visual dialogue answering [14, 68,
116]. In these tasks, human users will ask questions grounded in images, and the
1
machine should be able to answer the question using the information from the
images. Oftentimes, local information is more important than global information
for answering the questions. For example, given an indoor image, the human user
may ask if it is day or night. Then to answer this question, the machine needs to
search for local information, such as light sources, or windows.
Local information may provide cues when global information is not available,
such as in detection and localization tasks [25, 86, 36, 30, 67]. In many cases, local
cues are enough for us to detect or localize objects. For example, we can find a
cat by its face only.
Local information may also be of humans’ interests, such as in fashion anal-
ysis [90, 65]. In these tasks, people are interested in local details. For example,
given a fashion image which contains a model wearing multiple pieces of cloths
standing on the street, people may be interested in the jacket on the model.
Local information may help differentiate similar objects, such as in fashion
outfits compatibility study [29, 106] and fine-grained classification [117, 1, 123].
In these tasks, objects from different classes may only differ in local details. For
example, a compatible fashion outfit and an incompatible fashion outfit may only
differ in a few fashion items.
In this thesis, we focus on local-aware deep learning techniques for the reasons listed
above. Specifically, we focus on local-aware deep learning techniques for visual dia-
logue, face detection, fashion representation extraction and fashion outfit compatibility
analysis. We tackle each problem from a specific aspect of the local-aware deep learning
techniques.
2
1.1.1 Visual Dialogue
Visual dialogue system interacts with humans via natural language dialogue grounded
in vision (for example, images). It requires the system to reason on both vision and lan-
guage and generate consistent and natural language. An example is shown in Figure 1.1.
Figure 1.1: Example of visual dialogue. The human user asks questions regarding the
image and the machine answers the questions. A dialogue consists of multiple rounds
of questions and answers.
For visual dialogue systems, attention mechanism plays a very important role. It
attends to certain part of the inputs and extracts useful information from attended parts.
Unlike other problems which also need attention, e.g. visual captioning, visual ques-
tion answering, the attention mechanism for visual dialogue should be able to attend on
multi-modal inputs. Previous work either does not attend to all of the inputs [14, 68], or
manually designs a sequential order of attentions paid to each input [68, 116]. Conse-
quently, their proposed attentions cannot help answer complicated questions.
Furthermore, the response generators in previous work is trained using maximum
likelihood estimation (MLE) approach, which often suffer from the bias towards gener-
ating more frequent responses.
3
1.1.2 Face Detection
Face detection aims at predicting bounding boxes showing the locations of all faces in
an image. An example is shown in Figure 1.2. Face detection using deep learning tech-
niques has achieved impressive accuracy. Yet it is still challenging for mobile devices
which not only require high accuracy but also require fast speed and small model size.
Face?
Eye?
Nose?
Mouth?
Ear?
…
Background?
Face?
Background?
TCCF
Our Method
M < N
Figure 1.2: Example of face detection. The bounding boxes in green color around the
faces are desired prediction outputs.
Since an image may both contain large and small faces, it is challenging for the
detector to detect faces in various scales. Previous work either proposes a large network
to be capable for detecting faces in all scales [125, 4, 36], or designs a network for a
narrow range of face scales and re-scale the input image up and down [30, 134, 57, 85,
18]. The model size of the former is not feasible for most of mobile devices. Although
some work in the later group has small model size, the detection speed is still not high
enough for real-time applications on mobile devices. The challenge therefore lies on
how to accelerate current small models while keeping the high accuracy.
4
1.1.3 Fashion Representation Extraction
Fashion images are images that demonstrate fashion items, such as clothes, bags, shoes,
etc.. Oftentimes there is one or more models wearing fashion items in the images. Some
examples are shown in Figure 1.3. Fashion image analysis has many applications such
as personalized recommendation and online shopping. It requires representative feature
extraction from fashion images.
Figure 1.3: Example of fashion images. The background can be either simple or in the
wild.
Recent work relies on well-annotated data [60, 61, 65, 67, 39]. Nevertheless, man-
ually annotating fashion images is more complicated than other data types due to its
ambiguous nature. The boundary between fashion concepts can be subjective. For
example, the same item labeled as a “jacket” by one annotator may be labeled as a
“coat” by another. Instead of spending lots of resources on manual annotation of fash-
ion images, some focus on utilizing online resources for fashion studies without human
annotations [121, 91, 90]. These online resources often provide rich but inaccurate (or
5
noisy) metadata [91]. It remains an open problem of how to take advantage of such
online resources.
1.1.4 Fashion Outfit Compatibility Learning
Fashion outfits are sets of fashion items. The fashion outfit compatibility analysis aims
to analyze whether a given fashion outfit is compatible and why it is compatibile. Four
compatible fashion outfits are demonstrated in Figure 1.4.
Figure 1.4: Examples of compatible fashion outfits.
Most previous work on fashion outfit compatibility mainly focus on compatibility
between two fashion items and then aggregate the outfit compatibility by averaging the
compatibility between each pair of fashion items. A few recent work models fashion
outfits as sequences or graphs, which better analyze the compatibility of entire outfits.
However, the order of their proposed sequence is artificial and meaningless, which may
hurt the analysis. Although it is a reasonable choice to model an fashion outfit as a
graph, recent graph construction for fashion outfits cannot fully utilize the power of
recent deep learning techniques.
6
1.2 Contributions of the Research
1.2.1 Generative Visual Dialogue System with Attention
Our proposed system is based on multi-modal attention mechanism and weighted like-
lihood estimation:
Given multi-modal inputs of an image, a question and history of dialogue which
contains past questions and answers, we designed a recurrently guided multi-
modal attention mechanism. It incorporates with a recurrent neural network unit,
and is able to simultaneously attend to multi-modal inputs and recurrently refine
the attention guided by the inputs.
Given both positive and negative training samples, we developed a novel training
scheme for generative visual dialogue systems using weighted likelihood estima-
tion. It utilizes both positive and negative training samples, which improves the
quality of generated responses.
1.2.2 Leveraging Local Cues for Face Detection on Mobile Devices
Our proposed light-weighted proposal network accelerates face detection by inference
on local cues:
Given a input image, our proposal network needs less number of scaling up/down
of the input image to find potential face regions with same size range.
Our proposal network leverages both global and local cues to propose regions that
may contain faces. Leveraging multiple cues increases the range of size of face
regions can be detected from one scaled image.
7
1.2.3 Fashion Analysis using Local Region Characteristics
We propose a method to extract representative features for fashion analysis by utilizing
weakly annotated online fashion images in this work:
We proposed a two-stage system:
– In the first stage, we attempt to detect clothing items in a fashion image: the
top clothes (t), bottom clothes (b) and one-pieces (o).
– In the second stage, we extract discriminative features from detected regions
for various applications of interest.
We proposed a way to collect fashion images from online resources and conduct
automatic annotation on them. Based on this methodology, we create a new fash-
ion dataset, called the Web Attributes, to train our feature extractor.
1.2.4 Fashion Outfit Compatibility Learning with Global and Local
Supervisions
Our proposed system predicts and interprets fashion outfit compatibility:
Given an fashion outfit containing multiple fashion items, we constructed a graph
to model the entire outfit. The proposed graph construction models each pair-wise
relation between two fashion items as a node. This novel graph construction can
better utilize the advanced graph deep learning techniques.
Given fashion outfit data with only binary compatibility labels, we proposed a
joint compatibility prediction and outfit clustering method, which helps us inter-
pret the outfit compatibility.
8
Given compatible fashion outfit data, we designed a training scheme from data
sampling to global and local graph construction and corresponding loss functions
that improves the model’s discriminative power on subtle differences.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follow: in Chapter 2, we propose a genera-
tive visual dialogue system with a multi-modal attention mechanism, which simultane-
ously attends to multi-modal inputs and utilizes extracted local information to generate
dialogue responses; in Chapter 3, we propose a fast face detection system for mobile
devices, which detects salient facial parts and uses them as local clues for detection of
entire faces; in Chapter 4, we analyze fashion images by extracting representative fea-
tures from local regions, which contain local fashion details of humans’ interests; in
Chapter 5, we design a fashion outfit compatibility prediction and interpreting system,
which uses graphs to model the fashion outfits and learn the compatibility via advanced
graph deep learning techniques; in Chapter 6, we conclude the current work and discuss
our plan of the future direction.
9
Chapter 2
Generative Visual Dialogue System
with Attention
2.1 Introduction
Artificial Intelligence (AI) has witnessed rapid resurgence in recent years, due to many
innovations in deep learning. Exciting results have been obtained in computer vision
(e.g., image classification [93, 31], detection [86, 62, 132], etc.) as well as natural
language processing (NLP) (e.g., [114, 58, 133], etc.). Good progress has also been
made by researchers in vision-grounded NLP tasks such as image captioning [130, 53]
and visual question answering [3, 72]. Proposed recently, the Visual Dialogue (VD) [14]
task leads to a higher level of interaction between vision and language. In the VD task,
a machine conducts natural language dialogues with humans by answering questions
grounded in an image. It requires not only reasoning on vision and language, but also
generating consistent and natural dialogues.
Existing VD systems can be summarized into two tracks [14]: generative models
and discriminative models. The system adopting the generative model can generate
responses while that using the disriminative model only chooses responses from a can-
didate set. Although discriminative models achieved better recall performance on the
benchmark dataset [14], they are not as applicable as generative models in real world
scenarios since candidate responses may not be available. In this work, we focus on the
design of generative VD systems for broader usage.
10
MLE
GAN
WLE
Answer
Generator
Answer
Discriminator
Answer
Generator
Answer
Generator
Black and red.
Yes, it is.
White.
I can’t tell.
No, I do not.
2.
Learning signal from positive answers
Learning signal from negative answers
A man wearing leather jacket
standing next to a motorcycle
Is it colored leather?
What color is his leather?
Yes, it is.
(a)
MLE
GAN
WLE
Answer
Generator
Answer
Discriminator
Answer
Generator
Answer
Generator
Black and red.
Yes, it is.
White.
I can’t tell.
No, I do not.
2.
Learning signal from positive answers
Learning signal from negative answers
A man wearing leather jacket
standing next to a motorcycle
Is it colored leather?
What color is his leather?
Yes, it is.
(b)
Figure 2.1: (a) An example from the VisDial dataset, and (b) comparison between MLE,
GAN and WLE, where positive responses are highlighted in blue. The MLE-based
generator learn from data in positive answers only. The GAN-based generator learn from
data in negative answers through discriminators indirectly. Our WLE-based generator
learn from data in both positive and negative answers.
One main weakness of existing generative models trained by the maximum like-
lihood estimation (MLE) method is that they tend to provide frequent and generic
responses like ‘Don’t know’. This happens because the MLE training paradigm latches
on to frequent generic responses [68]. They may match well with some but poorly for
others. There are many possible paths a dialogue may take in the future — penaliz-
ing generic poor responses can eliminate candidate dialogue paths and avoids abuse
of frequent responses. This helps bridge the large performance gap between genera-
tive/discriminative VD systems.
To reach this goal, we propose a novel weighted likelihood estimation (WLE) based
training scheme. Specifically, instead of assigning equal weights to each training sample
as done in the MLE, we assign a different weight to each training sample. The weight
11
of a training sample is determined by its positive response as well as the negative ones.
By incorporating supervision from both positive and negative responses, we enhance
answer diversity in the resulting generative model. The proposed training scheme is
effective in boosting the VD performance and easy to implement.
Another challenge for VD systems is effective reasoning based on multi-modal
inputs. Previous work pre-defined a set of reasoning paths based on multi-modal inputs.
The path is specified by a certain sequential processing order, e.g., human queries fol-
lowed by the dialogue history and then followed by image analysis [68]. Such a pre-
defined order is not capable of handling different dialogue scenarios, e.g., answering a
follow-up question of ‘Is there anything else on the table?’. We believe that a good rea-
soning strategy should determine the processing order by itself. Here, we propose a new
reasoning module, where an adaptive reasoning path accommodates different dialogue
scenarios automatically.
There are three major contributions of this work. First, an effective training scheme
for the generative VD system is proposed, which directly exploits both positive and
negative responses using an unprecedented likelihood estimation method. Second,
we design an adaptive reasoning scheme with unconstrained attention on multi-modal
inputs to accommodate different dialogue scenarios automatically. Third, our results
demonstrate the state-of-the-art performance on the VisDial dataset [14]. Specifi-
cally, our model outperforms the best previous generative-model-based method [116]
by 3.06%, 5.81% and 5.28 with respect to the recall@5, the recall@10 and the mean
rank performance metrics, respectively.
12
2.2 Related Work
Visual dialogue. Different visual dialogue tasks have been examined recently. The
VisDial dataset [14] is collected from free-form human dialogues with a goal to answer
questions related to a given image. The GuessWhat task [15] is a guessing game with
goal-driven dialogues so as to identify a certain object in a given image by asking yes/no
questions. In this work, we focus on the VisDial task. Most previous research on the
VisDial task follows the encoder-decoder framework in [98]. Exploration on encoder
models includes late fusion [14], hierarchical recurrent network [14], memory net-
work [14], history-conditioned image attentive encoder (HCIAE) [68], and sequential
co-attention (CoAtt) [116]. Decoder models can be classified into two types: (a) Dis-
criminative decoders rank candidate responses using cross-entropy loss [14] or n-pair
loss [68]; (b) Generative decoders yield responses using MLE [14], which can be fur-
ther combined with adversarial training [68, 116]. The latter involves a discriminator
trained on both positive and negative responses, and its discriminative power is then
transferred to the generator via auxiliary adversarial training.
Weighted likelihood estimation. Being distinct from previous generative work that
uses either MLE or adversarial training, we use WLE and develop a new training scheme
for VD systems in this work. WLE has been utilized for different purposes. For example,
it was introduced in [113] to remove the first-order bias in MLE. Smaller weights are
assigned to outliers for training to reduce the effect of outliers [80]. The binary indicator
function and the similarity scores are compared for weighting the likelihood in visual
question answering (VQA) in [35]. We design a novel weighted likelihood remotely
related to these concepts, to utilize both positive and negative responses.
13
Hard example mining. Hard example mining methods are frequently seen in object
detection algorithms, where the amount of background samples is much more than the
object samples. In [87], the proposed face detector is trained until convergence on
sub-datasets and applied to more data to mine the hard examples alternatively. Online
hard example mining is favored by later work [89, 62], where the softmax-based cross
entropy loss is used to determine the difficulty of samples. We adopt the concept of sam-
ple difficulty and propose a novel way to find hard examples without the preliminary of
softmax-based cross entropy.
Multi-modal reasoning. Multi-modal reasoning involves extracting and combining
useful information from multi-modal inputs. It is widely used in the intersection of
vision and language, such as image captioning [119] and VQA [118]. For the VD task,
reasoning can be applied to images (I), questions (Q) and history dialogues (H). In [68],
the reasoning path adopts the order “Q! H! I”. This order is further refined to
“Q! I! H ! Q” in [116]. In the recent arxiv paper [24], the reasoning sequence of
“Q! I! H ” is recurrently occurring to solve complicated problems. Unlike previous
work that defines the reasoning path order a priori, we propose an adaptive reasoning
scheme with no pre-defined reasoning order.
2.3 Proposed Generative VD System
In this section, we describe our approach to construct and train the proposed
generative visual dialogue system. Following the problem formulation in [14],
the input consists of an image I, a ‘ground-truth’ dialogue history H
t1
=
( C
|{z}
h
0
; (Q
1
;A
1
)
| {z }
h
1
; ; (Q
t1
;A
t1
)
| {z }
h
t1
) with image captionC and a follow-up questionQ
t
at
14
roundt. N candidate responsesA
t
=fA
1
t
;A
2
t
; ;A
N
t
g are provided for both training
and testing. Figure 2.1(a) shows an example from VisDial [14].
We adopt the encoder-decoder framework [98]. Our proposed encoder, which
involves an adaptive multi-modal reasoning module without pre-defined order, will be
described in details in Sec. 2.3.1. The generative decoder receives the embedding of the
input tripletfI;H
t1
;Q
t
g from the encoder and outputs a response sequence
^
A
t
. Our
VD system is trained using a novel training scheme with weighted likelihood estimation,
which will be described in Sec. 2.3.2 with details.
2.3.1 Adaptive Multi-modal Reasoning (AMR)
To conduct reasoning on multi-modal inputs, we first extract image feature F
I
2
R
NHW
by a convolutional neural network, where N is the length of the feature,
and H and W are the height and width of the output feature map. The question fea-
ture F
Q
2 R
Nl
Q
and history feature F
H
2 R
Nl
H
are obtained by recurrent neural
networks, wherel
Q
andl
H
are the length of the question and the history, respectively.
Our reasoning path consists of two main steps, namely the comprehension step and
the exploration step, in a recurrent manner. In the comprehension step, useful informa-
tion from each input modality is extracted. It is apparent that not all the input infor-
mation is equally important in the conversation. Attention mechanism is thus useful to
extract relevant information. In the exploration step, the relevant information is pro-
cessed and the following attention direction is determined accordingly. Along the rea-
soning path, these two steps are performed alternatively.
In [68, 116], the comprehension and exploration steps are merged together. The
reasoning scheme focuses on one single input modality at each time and follows a pre-
defined reasoning sequence through each input modality. However, this pre-defined
order cannot accommodate various dialogue scenarios in real world. For example, a
15
question of “How many people are there in the image?” should yield a short reasoning
sequence like
question
the word ‘people’
! image
regions of people
;
whereas a question of “Is there anything else on the table?” should result in a long
reasoning sequence such as
question
the word ‘table’
! image
regions of table
! question
the word ‘else’
! history
context for ‘else’
:
To overcome the drawback of pre-defined reasoning sequence, we propose an adap-
tive multi-modal reasoning module as illustrated in Figure 2.2.
Let denote any multi-modal feature type (image, question or history), andF
2
R
NM
denote the features to be attended, where M is the number of features. The
guided attention operation that paying attention according to the given guide is denoted
as f
= GuidedAtt(F
;f
g
), where f
g
2 R
N1
is the attention guiding feature. The
guided attention can be expressed as:
E
= tanh(W
F
+W
g
f
g
1
T
) (2.1)
a
= softmax(E
T
w
att
) (2.2)
f
=F
a
; (2.3)
whereW
,W
g
andw
att
are learnable weights,1 is a vector with all elements set to 1.
In time stepi, the image featuresF
I
, the question featuresF
Q
and the history features
F
H
are attended separately by their own guided attention blocks. During the comprehen-
sion step, the outputs of the guided attention blocksf
I;i
,f
Q;i
andf
H;i
, i.e. the extracted
information from each modality, are merged intof
QIH;i
. During the exploration step, the
16
F
I
F
Q
Guided
Attention
Guided
Attention
Guided
Attention
f
I, i
f
H, i
f
Q, i
f
QIH, i
Reasoning
RNN
f
g, i
F
H
E
i = i
max
?
Yes
No
Comprehension
Exploration
Figure 2.2: The adaptive multi-modal reasoning.
merged vector is processed in the reasoning RNN block, which generates the new atten-
tion guiding featuref
g;i
to guide the attention in time stepi + 1. The final embedding
featureE is
E = tanh(Wf
QIH;imax
); (2.4)
whereW is learnable weights,i
max
is the maximum number of recurrent steps.
Through this mechanism, the reasoning RNN block maintains a global view of the
multi-modal features and reasons what information should be extracted in the next time
17
step. The information extraction order and subject are therefore determined adaptively
along the reasoning path.
2.3.2 WLE-based Training Scheme
As the discriminative VD models are trained to differentiate positive and negative
responses, they perform better on the standard discriminative benchmark. In contrast,
the generative visual dialogue models are trained to only maximize the likelihood of
positive responses. The MLE loss function is expressed as:
L
MLE
=
P
m
log(p
pos
m
); (2.5)
wherep
pos
m
denotes the estimated likelihood of the positive response of samplem. There
is only one positive response per sample provided for training in the VisDial task. How-
ever, there are many possible paths a dialogue may take in the future, the MLE approach
therefore favors the frequent and generic responses when the training data is limited
[68]. In the VisDial task, negative responses are selected from positive responses to
other questions, including frequent and generic responses. Incorporating the negative
responses to maximize the learning from all available information is thus essential to
improve the generative models.
We propose a WLE based training scheme to utilize the negative responses and rem-
edy the bias of MLE. Rather than treating each sample with equal importance, we assign
a weight
m
to each estimated log-likelihood as:
L
WLE
=
P
m
m
log(p
pos
m
): (2.6)
We can interpret the weighted likelihood as a hard sample mining process. We are
inspired by OHEM [89] and focal loss [62] designed for object detection, where hard
18
samples are mined using their loss values and receive extra attention. Rather than using
the preliminary softmax cross entropy loss for discriminative learning, we propose to
use likelihood estimation to mine the hard samples. If the current model cannot predict
the likelihood for a sample well, it indicates that this sample is hard for the model. Then
we should increase the weight for this hard sample and vice versa.
Given both positive and negative responses for training, we propose to assign
weights as:
m;n
= 1
log(p
neg
m;n
)
log(p
pos
m
)
; (2.7)
~
m
= exp
max
n
(
m;n
)
; (2.8)
m
= max
~
m
;
; (2.9)
where p
neg
m;n
denotes the n-th negative response of sample m, and
are hyper-
parameters to shape the weights.
We can also view the proposed loss function as a ranking loss. We assign a weight to
a sample by comparing the estimated likelihood of its positive and negative responses.
m;n
measures the relative distance of likelihood between the positive response and the
n-th negative response of sample m. If the likelihood of a positive response is low
comparing to the negative responses, we should penalize more by increasing the weight
for this sample. If the estimated likelihood of a positive sample is already very high, we
should lower its weight to reduce the penalization.
19
Model MRR R@1 R@5 R@10 Mean
LF [14] 0.5199 41.83 61.78 67.59 17.07
HREA [14] 0.5242 42.28 62.33 68.17 16.79
MN [14] 0.5259 42.29 62.85 68.88 17.06
HCIAE [68] 0.5467 44.35 65.28 71.55 14.23
FlipDial [73] 0.4549 34.08 56.18 61.11 20.38
CoAtt [116] 0.5578 46.10 65.69 71.74 14.43
Coref [52] 0.5350 43.66 63.54 69.93 15.69
Ours 0.5614 44.49 68.75 77.55 9.15
Table 2.1: Performance of generative models on VisDial 0.9. ‘Mean’ denotes mean
rank, for which lower is better. All the models use VGG as backbone except for Coref
which uses ResNet.
2.4 Experiments
2.4.1 Dataset
We evaluate our proposed model on the VisDial dataset [14]. In VisDial v0.9, on which
most previous work has benchmarked, there are in total 83k and 40k dialogues on
COCO-train and COCO-val images, respectively. We follow the methodology in [68]
and split the data into 82k fortrain, 1k forval and 40k fortest. In the new version
VisDial v1.0, which was used for the Visual Dialog Challenge 2018, train consists
of the previous 123k images and corresponding dialogues. 2k and 8k images with dia-
logues are collected forval andtest, respectively.
Each question is supplemented with 100 candidate responses, among which only one
is the human response for this question. Following the evaluation protocol in [14], we
rank the 100 candidate responses by their estimated likelihood and evaluate the models
using standard retrieval metrics: (1) mean rank of the human response, (2) recall rate of
the human response in top-k ranked responses fork = 1; 5; 10, (3) mean reciprocal rank
(MRR) of the human response, (4) normalized discounted cumulative gain (NDCG) of
all correct responses (only available for v1.0).
20
Question
Is there any people
do you see ?
What type of
bag it is?
What color are the
kites?
Is it sunny in the pic?
Human
Response
0 A leather bag All different colors Ys
Rank 1 No A leather bag All different colors Yes
Rank 2 No people UNK White and Red Overcast
Rank 3 No people in picture PC Yellow and green Partly
Rank 4 0 Tennis Blue and white Kind of
Rank 5
No, there aren’t any
people around
Not sure Yellow and green It is sunny
Rank 6 No 1 else A lab Red, black and blue No, cloudy
Rank 7 ‘no Folding wooden They are green No, very overcast
Rank 8 Nope Banquet maybe Green and white No kind of overcast
Rank 9 No, I can’t see any An orange chair Both are black Ys
Rank 10 Nope, just a bear Restaurant They are both black Looks like a overcast
Figure 2.3: Examples of top-10 responses ranked by our model. When there are mul-
tiple correct responses to the question, our model may choose other candidates that are
semantically similar to the human response. The human responses are highlighted in
blue.
2.4.2 Implementation Details
We follow the procedures in [68] to pre-process the data. The captions, questions and
answers are truncated at 24, 16 and 8 words for VisDial v0.9, and 40, 20 and 20 words
for VisDial v1.0. V ocabularies are built afterwards from the words that occur at least
five times intrain. We use 512D word embeddings, which are trained from scratch
and shared by question, dialogue history and decoder LSTMs.
For a fair comparison with previous work, we adopt the simple LSTM decoder with
a softmax output which models the likelihood of the next word given the embedding
feature and previous generated sequence. We also set all LSTMs to have single layer
with 512D hidden state for consistency with other works. We extract image features
from pre-trained CNN models (VGG [93] for VisDial v0.9, ResNet [31] or bottom-up
21
Rank 1 2 3 4 5 6 7 8
NDCG 57.75 56.45 55.38 55.33 54.31 52.65 50.27 48.76
57.75
56.45
55.38 55.33
54.31
52.65
50.27
48.76
47.79
45.78
0
20
40
60
80
1 2 3 4 5 6 7 8 9 10
NDCG
RANK
Figure 2.4: Results of the top-10 teams in the first visual dialog challenge. As the only
team in top-10 uses generative visual dialogue system, we are ranked as the 6th place
(highlighted with gray color). Our NDCG score is comparable with other discriminative
systems.
Model MRR R@1 R@5 R@10 Mean
MN [14] 0.4799 38.18 57.54 64.32 18.60
HCIAE [68] 0.4910 39.35 58.49 64.70 18.46
CoAtt[116] 0.4925 39.66 58.83 65.38 18.15
ReDAN [24] 0.4969 40.19 59.35 66.06 17.92
Ours 0.5015 38.26 62.54 72.79 10.71
Table 2.2: Performance of generative models on VisDial v1.0 val. Results of previous
work are reported by ReDAN.
features [2] for VisDial v1.0), and train the rest of our model from scratch. We use the
Adam optimizer with the base learning rate of 4 10
4
.
2.4.3 Experiments Results and Analysis
Baselines
We compare our proposed model to several baselines and the state-of-the-art genera-
tive models. In [14], three types of encoders are introduced. Late Fusion (LF) extracts
22
Model MRR R@1 R@5 R@10 Mean
HCIAE-MLE 0.5386 44.06 63.55 69.24 16.01
HCIAE-GAN 0.5467 44.35 65.28 71.55 14.23
HCIAE-WLE 0.5494 43.43 66.88 75.59 9.93
AMR-MLE 0.5403 44.17 63.86 69.67 15.49
AMR-WLE 0.5614 44.49 68.75 77.55 9.15
Model MRR R@1 R@5 R@10 Mean
HCIAE-MLE — — — — —
HCIAE-GAN +0.0081 +0.29 +1.73 +2.31 -1.78
HCIAE-WLE +0.0108 -0.92 +3.33 +6.35 -6.08
AMR-MLE — — — — —
AMR-WLE +0.0211 +0.32 +4.89 +7.88 -6.34
Table 2.3: Ablation study on VisDial 0.9. Top: absolute values. Bottom: improvement
from MLE models.
features from each input separately and fuses them in the later stage. Hierarchical Recur-
rent Encoder (HRE) uses hierarchical recurrent encoder for dialogue history and HREA
adds attention to the dialogue history on top of the hierarchical recurrent encoder. Mem-
ory Network (MN) uses memery bank to store the dialogue history and find correspond-
ing memory to answer the question. History-Conditioned Image Attentive Encoder
(HCIAE) is proposed in [68] to attend on image and dialogue history and trained with
generative adversarial training (GAN). Another concurrent work with GAN [116] pro-
poses a co-attention model (CoAtt) that attends to question, image and dialogue history.
FlipDial [73] uses V AE for sequence generation. We also compare to a neural module
network approach Coref [52] in which only the performance with ResNet [31] back-
bone is reported. ReDAN [24] is recently proposed method which involves a multi-step
reasoning path with pre-defined order.
23
Results on VisDial v0.9
Table 2.1 compares ours results to other reported generative baselines. Our model per-
forms the best on most of the evaluation metrics. Comparing to HCIAE [68], our model
shows comparable performance on R@1, and outperforms on MRR, R@5, R@10 and
mean rank by 1.47%, 3.47%, 6%, 5.08, respectively. Our model also outperforms CoAtt
[116], which achieved the previous best results for generative models. Our result sur-
pass it with large margins on R@5, R@10 and mean rank by 3.06%, 5.81% and 5.28,
respectively.
While our model demonstrates remarkable improvement on R@5, R@10 and mean
rank, MRR shows moderate gain while R@1 is slightly behind. We attribute this to
the fact that there could be more than one correct response among the candidates while
only one is provided as the correct answer. As demonstrated by the examples of top-10
responses in Figure 2.3, our model is capable of ranking multiple correct answers to
higher places. However, the single human answer is not necessarily ranked the 1st, thus
greatly affecting R@1.
Results on VisDial v1.0
In the Visual Dialog Challenge 2018, all correct responses in test are annotated by
humans and considered in the evaluation. Figure 2.4 represents the top-10 results. Our
model, as the only generative model in the top-10, ranked as the 6th among those dis-
criminative models. It also verifies our claim that our low R@1 score on v0.9 is because
the evaluation only considers the human response but ignore all other correct responses.
We used ResNet features for the challenge. Since ReDAN only reports its generative
performance on VisDial v1.0val with bottom-up features, we also present our results
using the same setting in Table 2.2. We list the results of previous work in Table 2.2 as
24
Question: What color is the airplane?
Question: Can you see any buildings?
Time step i = 1 Time step i = 2
Figure 2.5: Visualization of image attention heatmaps for different questions and rea-
soning steps. Regions of attention are highlighted in blue.
reported in [24]. Similar to the results on VisDial v0.9, our proposed method outper-
forms previous methods on MRR, R@5, R@10 and Mean.
Ablation Study
Our model contains two main novel components, namely the adaptive multi-modal rea-
soning module and the WLE based training scheme. To verify the contribution of each
component, we compare the following models : (a) HCIAE-MLE is the HCIAE model
25
Images Questions Human responses MLE responses WLE responses
What color ducks ?
They are geese they have black
heads and
Black and white Black and white
Are they large or small ? Large i guess , or normal sized They are large They are large
Any grass ? Yes Yes , in the background Yes , a lot of grass
Any people ? No Yes , there are a lot of people No
What color grass ?
It is green with brownish dead
spots
Green Green and brown
Any vehicles on the lot ? Yes Yes , quite a few Yes , there are a lot of cars
How many ? I can see 6 I see 2 I see 2
What kind ? Some suvs and some sedans I ca n't tell I ca n't tell what it is ,
Do they look new or old ? They look new They look new They look new
What color is the lot? Normal black asphalt It is gray It 's gray
Are zebras in zoo ? No zebras not in zoo No No , they are in wild
Are they adults or babies ? I would say they are adults They look like adults They look like adults
Are they standing next to each other ? 1 lead in front , while other 1 Yes They are side by side
Is there water source in image ? No No water No water is visible
Is it sunny out ? It does look like sunny Yes It appears to be
Any trees ? There are few trees but is not in There are trees in background There are trees in background
Are zebras eating ? No , they are walking No No they are not
Any other animals in picture ? No No , just zebras No , just zebras
Is there any type of enclosure ? No There is no fence There is no fence
Is grass green ? No No , it is brown No , it 's brown
Are there umbrellas ? Yes , 1 umbrella Yes Yes
How many people are there ? I see 5 More than 10 A lot of people
What color is the umbrella ? I ca n't tell , the image is The photo is black and white The picture is in black and white
Are they male or female ? Looks like 4 females and 1 male I ca n't tell They are too far away to tell
Is it an old time picture ?
Modern picture , just in black
and white
I do n't think so No , it 's old
Are they on a sidewalk ? They 're in a parking lot Yes No , they are on a sidewalk
Is there any buildings ? Yes , across the street from Yes , quite a few Yes , in the background
Is it a city street ? Yes Yes Yes , it 's a city street
Do they appear to know each other ? Looks like 4 of them do , and No
No , they look like they are
walking
Are the buildings businesses ? Yes I do n't think so I ca n't tell , the buildings are
Images Questions Human responses MLE responses WLE responses
How many men are there ? There are 2 2 2
Is this photo in color ? Yes , it is Yes Yes , it is
How old are there ? Maybe twenties or thirties
They look like they are in their
late
They look to be in their 20s
How are they dressed ? 1 has on a red shirt , 1 They are dressed casually They are dressed casually
Are there a lot of trees ? Yes , tons Yes , there are Yes , there are a lot of trees
Is it daytime outside ? Yep Yes , it is Yes , it 's daytime
Is it cold out ? Does n't look very cold I ca n't tell It does n't appear to be
Do you think it is sunny outside ? Yeah Yes , very Yes , it 's sunny
A couple of men are standing in
the woods
4 ducks are in a grassy island of
a parking lot with their heads
down
People walk near a street in the
rain
2 zebras standing in tall
savannah grass near forest
brush
Figure 2.6: Qualitative results ontest. The questions and answers are truncated at 16
and 8, respectively, same as our data pre-processing.
trained via MLE; (b) HCIAE-GAN is the HCIAE model trained via MLE and GAN;
(c) HCIAE-WLE is the HCIAE model trained via WLE; (d) AMR-MLE is our AMR
model trained via MLE; (e) AMR-WLE is our final model with both key components.
The results are listed in Table 2.3. The effectiveness of the proposed reasoning
scheme is demonstrated in the HCIAE-MLE vs. AMR-MLE and HCIAE-WLE vs.
AMR-WLE comparisons where our model outperforms HCIAE on all metrics. The
importance of our proposed WLE is highlighted in the comparison between HCIAE-
WLE and HCIAE-GAN. HCIAE-WLE performs better on all metrics. Specifically, the
improvement on the HCIAE model by WLE is more than twice of that by GAN on
R@10 (6.35 vs. 2.31) and mean rank (6.08 vs. 1.78). Our proposed training scheme is
therefore also compatible and effective with other encoders.
26
Qualitative Results
Examples of image attention heatmaps are visualized in Figure 2.5, which demonstrate
the adaptive reasoning focuses for different questions and reasoning time steps. For
example, for the second question, the attention on image was first at a large area of
background, then moved to more focused region to answer the question ’any buildings’.
Figure 2.6 shows some qualitative results on test. Our generative model is able
to generate more non-generic answers. As evidently shown in the comparison between
MLE and WLE, the WLE results are more specific and human-like.
2.5 Conclusion
In this work, we have presented a novel generative visual dialogue system. It involves
an adaptive reasoning module for multi-modal inputs. The proposed reasoning module
does not have any pre-defined sequential reasoning order and can accommodate various
dialogue scenarios. The generative visual dialogue system is trained using weighted
likelihood estimation, for which we design a new training scheme for generative visual
dialogue systems.
27
Chapter 3
Leveraging Local Cues for Face
Detection on Mobile Devices
3.1 Introduction
Increased and extensive usage of cameras in mobile devices, such as smartphones and
drones, has spawned a wide range of applications. This is especially true for face-
related applications. Face detection is an important prerequisite of many face related
applications, including face recognition [88], face alignment [134], face editing [94],
face manipulation [135] and tracking [49].
Face detection has been extensively studied for several decades. Satisfactory perfor-
mance is achieved on high-performance computers nowadays. With the swift advances
in processing power and memory, real-time video processing for computer vision tasks
in mobile devices [97] is within reach. Traditional computer vision algorithms such
as the Viola-Jones method [109], the deformable part model [74] and their extensions
have already offered good performance in context-constrained environments. However,
they fail to handle images in unconstrained environments due to the large variation of
poses, resolutions, illumination and occlusion. Inspired by the success of the convo-
lutional neural network (CNN) in image classification [54] and object detection [86],
CNN-based face detection methods have achieved remarkable performance even with
occlusion, pose variations and lighting changes. One of the remaining challenges is
efficient detection of scale-variant faces.
28
Most CNN-based methods solve the scale-invariant problem with two strategies.
The first one is to decompose an image into different pyramid levels and, then, feed
the image pyramid to the network, e.g., [57, 85, 134, 18]. These methods are time-
consuming since an input image has to be rescaled up and down many times. The
network inference has to be performed on all rescaled images. The second one is to build
a deeper and larger network where features at different layers are utilized to achieve
scale invariance. Although it performs on a single input image, it requires a larger
model size and heavier computation.
To adapt CNN-based methods from a high-performance platform to mobile devices,
we are not allowed to use larger and deeper networks. In this work, we investigate the
use of a CNN cascade as proposed in [57], which consists of several light-weighted
networks. The first network of the CNN cascade serves as a proposal generation module
that scans through the whole image quickly to obtain face candidates. However, the
proposal network can only detect faces of sizes in a narrow range. If the face regions
exceed its receptive field, it will fail to capture the global characteristics. One possible
solution is to rescale the image to different levels and scan each rescaled image using
the same network repetitively. However, this is highly inefficient in computation.
To address the above-mentioned shortcoming, we improve this proposal network by
utilizing both global and local facial characteristics so that the resulting network has the
multi-scale capability in proposal generation. Furthermore, for face regions that exceed
the processing size, we use captured local facial characteristics as cues to infer the face
location. Consequently, face regions with multiple sizes can be found in a single forward
pass. In this way, fewer pyramid levels are required to locate faces at different scales.
Finally, we speed up the proposed model using the model acceleration technique [56]
and reduce its training time by adding an auxiliary loss function term.
The main contributions of this work can be summarized below.
29
We introduce a method for face proposals by capturing both global and local facial
characteristics and, therefore, reduce the number of levels of the input image pyra-
mid.
We propose a face detector that has high accuracy while meeting the crucial mem-
ory and speed requirements of mobile devices.
We present a method to quickly infer the location of face regions using local facial
characteristics.
We improve the training time by adding an auxiliary loss function term for model
acceleration.
3.2 Review of Related Work
3.2.1 Traditional Face Detectors
Early work on face detection mainly relies on hand-crafted features followed by the
classification task of certain classifiers. The Viola-Jones detector [109] adopts the Haar
features and the AdaBoost classifier. It is still a popular method nowadays due to its
small model size and fast speed. Detectors based on the deformable part model (DPM)
[74, 136] define a face as a collection of its parts. Then, a latent support vector machine
(SVM) is used as a classifier that finds parts of human faces and their geometric relation-
ship. Later, sparse coding features [96, 43] are extracted for performance improvement.
Although DPM-based methods can achieve remarkable performance, they are computa-
tionally expensive and sensitive to hand-craft features. Multiple visual cues, including
texture and stereo disparity, are combined in [42] to improve detection performance and
reduce computational complexity. In [44], features used for face detection and localiza-
tion are classified and bound into different groups. During detection, the information of
30
each group is extracted separately and, then, integrated using the constraint relationship
to improve localization accuracy. Recently, a boosted-decision-tree-based face detector
[82] outperforms all other non-CNN techniques, and it operates at a fast speed. On the
other hand, unlike CNN-based models, the boosted-tree model does not have additional
modeling capacity so that it does not benefit as much from a large number of training
data [82].
3.2.2 CNN-based Face Detectors
Many CNN-based methods have been proposed in recent years. They achieve better per-
formance due to powerful discriminative capability. In [21], a pre-trained AlexNet [54]
is fine-tuned and converted to fully-convolutional structure to fit different input sizes.
The feature map can be directly used as a heatmap to localize faces. In [110], a CNN is
designed for multi-tasks to improve face detection accuracy, where the multi-task loss
is used to classify faces/non-faces and regress face bounding boxes simultaneously.
Research on effective face detection in mobile device applications has received a
lot of attention in recent years. The CNN cascade [134, 57, 85, 18] offers an attractive
solution due to its small model size and fast speed. It consists of several stages and
each stage alone is a CNN-based binary classifier that classifies an input patch as face or
non-face categories. From the first stage to the last stage, the network grows deeper and
becomes more discriminative to address challenging false positives. During detection,
the first stage quickly scans the test image and obtains candidate facial windows. All
candidates will be passed to the next stage which further reduces the number of false
candidates. In this way, most of the patches are eliminated by shallower networks before
reaching the last network, which is faster than feeding the entire image into a deep
network directly. Instead of training each network in the cascade separately, a joint
training framework is proposed in [85]. To further improve performance accuracy of a
31
cascade, face detection and alignment are jointly learned in [134]. It achieves the highest
accuracy and speed among all CNN cascades. The nested soft-cascade is introduced
in [18], where each shallow localization network is trained on different data and then
assembled in a soft-cascade fashion.
Face images need to be scaled up/down to form an image pyramid so as to accommo-
date various face sizes. Since the inference step is repeated for each level of the image
pyramid, the running time increases significantly. To address this problem, branches at
different scales are added to reduce the image pyramid level or totally abandon it. Multi-
scale branches are added to the end of the network so as to reduce the image pyramid
to octave-space scales in [4]. Multiple proposal networks are utilized in [125] to avoid
the image pyramid. Furthermore, for each proposal network, a corresponding detec-
tion network is applied to its proposals. Significant improvement in tiny face detection
is achieved in [36] by training separate detectors and defining multiple templates for
different scales.
Other methods tackle the multi-scale problem by combining multiple large networks.
Building larger or multiple networks with different scales will improve the accuracy but
also induce extra memory and computational cost. Heatmaps of facial parts are obtained
from five different networks and combined into a single heatmap in [124]. The faceness
measure of a candidate bounding box is calculated based on the geometry of each part.
The face proposals are then refined by the fine-tuned AlexNet [54]. An encoder-decoder
network is used in [111] to detect facial landmarks and extract features. Feature maps
are obtained from different layers of the encoder-decoder network and feature maps
that have the same dimension are concatenated together. A sliding window method
is applied to each concatenated feature map. Afterwards, the RoI pooling [86] maps
features in each window region to a fixed feature vector for classification. A scale-
aware framework is adopted in [30], where possible face sizes are first estimated by a
32
Face?
Eye?
Nose?
Mouth?
Ear?
…
Background?
Face?
Background?
TCCF
Our Method
M < N
Figure 3.1: Comparison between our method and a typical CNN cascade framework
(TCCF). The main difference is that local characteristics such as eyes, nose, mouth, etc.
are captured in our method so that a single level of the pyramid can encode multiple
scales of faces. The number of pyramid levels is thus reduced to speed up the proposal
generation process.
scale proposal network. Then, the image is re-sampled to corresponding scales and fed
into a single-scale detector. The number of image pyramid levels can be significantly
reduced if an image only contains faces of a few scales. However, the model size and
the associated computational complexity are still not suitable for mobile devices.
In this work, we adopt the CNN cascade as the baseline for its small model size
and, then, propose a new training architecture to accelerate the detection speed while
preserving detection accuracy.
33
3.3 Proposed Method using Local Facial Characteris-
tics
3.3.1 Proposal Network
Although a typical CNN cascade framework (TCCF) is faster than other deep learning
structures, it is still not easy to be implemented on mobile devices for real-time face
detection. The main bottleneck lies in the first stage where it serves as a proposal net-
work which feeds each level of the image pyramid. In the TCCF, an image has to be
resampled to a proper size to ensure that the face region matches the receptive field of
the proposal network (1212 is used in [134, 57, 85, 18]). Each pyramid level corre-
sponds to a specific scale of a face. As a result, we need to form a dense image pyramid
in order to achieve high detection accuracy. One way to save the computation time is
to reduce the number of pyramid levels at the expense of detection accuracy. If we
can encode more scales in one pyramid level, fewer levels will be needed. Then, the
proposal generation process will be accelerated.
Based on the above discussion, we present a new proposal network to reduce the
total number of image pyramid levels. The new network captures not only the global
characteristics but also local cues of faces. When the global characteristics is captured,
the input patch will be fed to the next stage as a face proposal. On the other hand, when
local cues are captured, the location of the face is inferred and the corresponding region
is used as the face proposal.
A comparison between our method and the previous TCCF is shown in Fig. 3.1.
Our network need to capture both global and local characteristics of faces. To achieve
this objective, we design the network as a multi-label classifier that can classify an input
patch to background, a global face or facial parts. Since we do not want to confuse
34
3
ࡹ ൈ ࡺ ൈ ૡ Feature Maps for
different facial part and whole face
Training Flow
Testing Flow
Multi Ͳ label Classification Network
Detected Whole Face
Predictions for different facial parts
Eyes?
Mouth?
Nose?
.
.
.
Figure 3.2: Illustration of the proposal network. In the training stage, face and facial
part patches are randomly cropped from training images and used to train a multi-label
classification network. In the testing stage, a test image is resized to form a sparse
pyramid, and fed into the multi-label classification network to generate heatmaps of
face and facial parts. Based on the heatmaps and bounding box templates of each facial
part, we can generate face proposals. These face proposals will be sent to the next stage
of the CNN cascade.
the classifier with similar regions such as cheek and forehead, we only choose the most
distinctive facial parts such as the eye, the nose and the mouth.
Our proposal network is shown in Fig. 3.2 and the network parameters are given in
Table 3.2. The input image resolution is 12 12, which is the same as that in previous
TCCF schemes [134, 57, 85, 18]. Inspired by [134] and [21], we design the network to
be fully-convolutional so that it can be applied to images of any resolution. This avoids
cropping out patches with a sliding window. The stride of the whole network is set to 2.
When the network scans an input image in the test stage, it is equivalent to the use of a
sliding window of stride 2.
Our design includes eight categories: 1) background, 2) the whole face, 3) an eye, 4)
a nose, 5) the whole mouth, 6) the left corner of a mouth, 7) the right corner of a mouth,
and 8) an ear. It is worthwhile to point out that most existing face detection methods
do not exploit ears. Yet, we observe that ears provide good local characteristics. This is
especially true for side faces. During the detection, given heatmaps of the global face
35
and facial parts, we need to generate proposals in terms of bounding boxes. This will be
elaborated in the next subsection.
3.3.2 Proposal Generation
Fast proposal generation from parts is a non-trivial problem. Previous methods in detect-
ing facial parts [74, 124, 111] are computationally intensive since it has to integrate
facial parts with the global face via sliding windows. Several ideas are outlined below.
Candidate windows can be obtained from generic object proposal generators and, then,
a faceness measure is conducted in each window for its evaluation in [124]. A sliding
window is applied for RoI pooling and the features of each window are mapped to a
fixed-size feature vector for classification in [111].
Being different from them, our method aims at utilizing facial parts to reduce
the number of image pyramid levels and speed up the detection process. It is time-
consuming to evaluate each sliding window since the number of sliding windows is
significantly larger than those of faces. Furthermore, we do not want to spend extra time
in generating generic object proposals. Thus, instead of using the candidate window
scheme and scoring each window, we propose to generate candidate bounding boxes
from the facial part heatmap directly. The bounding boxes are then combined and eval-
uated simultaneously.
The pipeline of our proposal generation process is shown in Fig. 3.3. It consists of
the following three steps.
1. Finding local maxima
For each facial part heatmap, we first apply threshold,
p
, to find the strong
response, where p denotes a certain facial part. The non-maximum suppression
(NMS) scheme is then used to obtain the strongest response points in local regions
of heatmap.
36
Figure 3.3: The pipleline of our proposal generation process, where the eye and the
mouth are adopted to illustrate facial parts.
2. Bounding box generating using templates
Bounding boxes are generated from the face heatmap and the facial part heatmap
separately. For the face heatmap, being similar to [21], we apply a threshold,
f
, to the heatmap and extract local maxima to generate bounding boxes. As
to bounding boxes for each facial part, we define a bounding box template for
each part separately. We define two templates for eyes so as to represent left
and right eyes. Each bounding box is determined by coordinates of its upper-left
vertex, (x
1
;y
1
), and bottom-right vertex, (x
2
;y
2
). Theith bounding box location
is denoted asb
i
= (x
i1
;y
i1
;x
i2
;y
i2
). For bounding boxi, its scorep
i
is set to equal
its corresponding value on the heatmap. In this way, we can infer the bounding
box of the face based on detected local maxima from the previous step.
3. Part box combination
For bounding boxes generated by different facial parts, we employ a way similar
to the NMS in combining them. That is, given a set of bounding boxes, we start
from the bounding box with the highest score and find all bounding boxes that
have an higher intersection over union (IoU) with it (say, higher than threshold
37
Table 3.1: Parameters of the proposal network.
Table 3.2: Proposal network without acceleration
Layer Kernel size Output size
Input 12123
Conv1 33 121216
Pool1 33 6616
Conv2 33 4432
Conv3 33 2232
Conv4 22 1164
Conv5 11 118
IoU
). These bounding boxes are merged by taking the average of their coordinates
as
b
m;i
=
1
jC
i
j
X
j2C
i
b
j
;
whereC
i
=fb
i
g
[
fb
j
:IoU(b
i
;b
j
)>
IoU
g:
(3.1)
The score of the merged bounding box is set to
p
m;i
= 1
Y
j2C
i
(1p
j
); (3.2)
which resembles the statistic rule of combining two independent events. The
merged bounding box is assigned to the proposal set while the bounding boxes
used for merging are eliminated from the original set. We repeat the searching
and merging process for the remaining bounding boxes until all bounding boxes
are processed.
38
Table 3.3: Proposal network with acceleration
Layer Kernel size Output size
Input 12123
Conv1 33 6616
Conv2 33 4432
Conv3 33 2232
Conv4 22 1164
Conv5 11 118
conv2 conv3 conv4 conv5
conv2 conv3 conv4 conv5
Before acceleration
After acceleration
L
cls
L
conv
max pooling
stride=2
Figure 3.4: Comparison of the standard and the accelerated training schemes for the
proposal network.
3.3.3 Model Acceleration
Inspired by [56], we accelerate the proposal deep network by merging non-tensor layers
with its neighboring convolution units since non-tensor layers demand a large amount
of computational time on mobile devices. The max-pooling layer and its previous con-
volutional layer in our proposal network are replaced by a new single convolutional
layer. To keep the same dimension, the stride of the max-pooling layer is shifted to the
convolutional layer. The parameters of the accelerated model are listed in Table 3.3.
39
The accelerated network in [56] is fine-tuned from the original network. The learn-
ing rate of the new convolutional layers is set to 10 times of the learning rate of other lay-
ers for faster convergence. However, since new convolutional layers could be far from
the last layer, the convergence can be slow when they are guided by back-propagation
from the classification loss only. To overcome this problem, we propose an auxiliary
loss function that compares the output of the max-pooling layer of the original network
with that of the new convolutional layer of the accelerated network.
The training diagram for the accerated proposal network is shown in Fig. 3.4. Here,
we use a weighted sum of the classification loss and an auxiliary loss in form of
L =L
cls
+L
conv
; (3.3)
where is a weighting factor, L
cls
is the classification loss, andL
conv
is the auxiliary
loss. The auxiliary loss is
L
conv
=kf
pool1
^
f
conv1
k
2
; (3.4)
wheref
pool1
is the pool-1 feature of the original network and
^
f
conv1
is the conv-1 feature
of the accelerated network.
We will compare the performance of fixed and adaptive weight values in Sec. 3.4.5.
The rationale of using an adaptive weight value is given below. In the beginning of
the training process, it is desired that the new layers converge quickly from the random
initialization. Thus, a larger value is preferred. Once the new conv-1 feature is close
to the original pool-1 feature, we should target at optimizing the final classification loss
and a smaller value is proper. Therefore, we can set to an exponentially decaying
function in the training process.
40
3.4 Experiments
3.4.1 Experimental Setup
As stated before, our accelerating proposal module can be combined with any face clas-
sifiers. To train a small model with satisfactory performance, we cascade it with a CNN
to construct the whole face detection pipeline. Specifically, after the proposal module,
we adopt two successive sub-networks that follow the same structure as the RNet (sec-
ond stage) and the ONet (last stage) used in the MTCNN [134]. As a result, we form a
three-stage cascaded lightweight deep face detector.
We evaluate the proposed face detector on two popular benchmarks: the WIDER-
face [125] and the FDDB [41]. The WIDER-face dataset has 393,703 labeled face
bounding boxes from 32,203 images while the FDDB dataset contains 5,171 annotated
faces. To build our training dataset, we use the WIDER-face [125] extract background
and facepatches. The WIDER-face dataset consists of 32,000 images, where 50% of
them are used for testing, 40% for training and the remaining ones are for validation.
Furthermore, the eye, nose and mouth patches for training are extracted from the CelebA
[66], which has around 200,000 images and most of the images contain a single face. We
also get 5,600 ear patches as facial parts due to its robustness to the side faces. Besides
public ear datasets such as AMI [27], AWE EAR [20], IIT Delhi [55] and WPUT [23],
we collected 400 extra ear samples from the Internet. We summarize the training data
in Table 3.4.
3.4.2 Evaluation of Model Size
We compare the size of our model with others. The results are listed in Table 3.5, where
* denotes our calculation based on the information from the literature, and the rest is
directly measured. We see from the table that our model is much smaller than those
41
Table 3.4: Summary of training data used in this work.
Datasets Images Type Usage
WIDER-face [125] 13k face detection face,background
CelebA [66] 202k landmark detection eye,nose,mouth
AMI [27] 0.7k ear recognition ear
AWE EAR [20] 1k ear recognition ear
IIT Delhi [55] 0.5k ear recognition ear
WPUT [23] 3k ear recognition ear
own data 0.4k ear images ear
Table 3.5: Comparisons of model sizes of several face detectors.
Work Model size
CEDN [111] 1.1GB*
DDFD [21] 233MB*
HR [36] 98.9MB
LCDF+ [82] 2.33MB
MTCNN [134] 1.9MB
Nested [18] 1.6MB*
Ours 1.96MB
complicated CNN frameworks such as DDFD [21], HR [36], CEDN [111]. It is even
smaller than LCDF+ [82], which is the state-of-the-art method with manually-crafted
feature framework. Our model has a size comparable with that of CNN cascade models,
e.g., the MTCNN [134] and the nested CNN detector [18].
3.4.3 Evaluation of Face Detection
We first compare the multi-scale capability of our face detector and the MTCNN [134],
which is the state-of-the-art CNN face detection engine. The WIDER-face validation
set [125] is used as evaluation for its large face scale variety. Since this experiment
targets at evaluating the performance in multi-scale detectability, we set different levels
42
Table 3.6: Comparisons of detection performance with the MTCNN [134] on the
WIDER-face validation set [125] with different scale factors.
Scale factor 0.79 0.50 0.25
Easy
MTCNN [134] 0.836 0.817 0.755
Ours 0.844 0.842 0.826
Medium
MTCNN [134] 0.809 0.798 0.744
Ours 0.809 0.805 0.794
Hard
MTCNN [134] 0.622 0.600 0.529
Ours 0.603 0.568 0.519
of scaling factor for the image pyramid. For fair comparison, we use the model provided
and follow the same parameter setting in [134]. The results are listed in Table 3.6.
We see from the results that both face detectors achieve satisfactory accuracy with
the dense image pyramid at the scale factor of 0.79. This scale is also chosen by the
MTCNN. Our detector outperforms the MTCNN on the Easy set while the MTCNN
performs better on the Hard set. It is worthwhile to emphasize that the MTCNN utilizes
joint training for face detection and facial landmark localization. The latter is not used
in our detector training.
As the image pyramid becomes sparse, MTCNN’s accuracy drops rapidly. When the
scale factor decreases from 0.79 to 0.25, its accuracy degrades by 8.1%, 6.5% and 9.3%
on the easy, medium and hard sets, respectively. In contrast, the accuracy of our method
without model acceleration drops by 1.8%, 1.5%, 8.4%, respectively.
To verify the benefit of using extra ear data for training, we compare the performance
of the model with and without extra ear training data on the WIDER-face validation set.
Without extra ear training data, the detection performance drops by 0.01, 0.012 and
0.046 on the easy, medium and hard sets, respectively.
Then, we conduct face detection experiments on the FDDB [41]. We use 0.25 as the
pyramid scaling factor and add an extra layer to the image pyramid with half of the size
43
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000
True positive rate
False positive
MTCNN
Ours
CEDN
Nested 48
Nested 36
HR
DDFD
CascadeCNN
ACF-multiscale
HeadHunter
Joint Cascade
Boosted Exemplar
Viola-Jones
Pico
Figure 3.5: Evaluation results on FDDB.
of the largest scale. As illustrated in Fig. 3.5, our method outperforms many others such
as the CEDN [111] and the nested CNN detector [18]. In terms of detection accuracy,
our model can achieve 94.35%. As compared to 83.29% obtained by the nested CNN
detector, we have 9.2% improvement. It also achieves comparable accuracy as com-
pared to the MTCNN [134] that uses more pyramid levels. The HR [36] outperforms
our model by a small margin, yet its model size is too large to be deployed on mobile
devices.
3.4.4 Evaluation of Runtime Efficiency
We first compare the detection speed with the MTCNN method [134] using their pro-
vided Matlab codes. For fair comparison, our method is also implemented in Matlab
codes. The experiment was conducted on the WIDER-face validation set [125] using
44
the GeForce GTX TITAN. We use original images without re-sampling to a fixed res-
olution. The runtime is calculated by averaging the time over the entire validation set.
For both detectors, the minimum face size to detect is set to 10 as used by the MTCNN.
The scaling factor of the MTCNN is 0.79 as given in its original setting, while scaling
factor of 0.25 and an extra pyramid layer is the setting for our detector with comparable
accuracy listed in Sec. 3.4.3. For some images in the WIDER-face validation set, the
number of proposals produced by the MTCNN is more than the number that our GPU
memory (12GiB) can take. Therefore, we take at most 20000 proposals per image for
the MTCNN, which in fact reduces the average runtime of the MTCNN. The average
runtime of the MTCNN in this case is 0.595s while that of our detector is 0.499s. We
reach more than 16% acceleration. Clearly, our detector can achieve similar accuracy
with a faster speed.
The running time claimed by the nested CNN detector [18] is 40.1ms using the CPU
only, where 640480 VGA image with 8080 as the minimum size. For comparison,
we follow the same setting of the resolution and the minimum face size. The data
used for runtime evaluation was not mentioned in [18]. Here, we evaluate detection
accuracy and running time on FDDB [41] with the same dataset in [18] for performance
benchmarking. With model acceleration, our model can get 39.1ms compared to 40.1ms
achieved by the nested CNN detector. It shows that we can still get a faster speed with
significant accuracy improvement as indicated in Sec. 3.4.3.
Furthermore, we implemented our method on Samsung Galaxy S8 using the Caffe.
The face detector received images of high resolution (1280720) from the back camera
continuously. By setting the minimum face size to 100 and scaling factor to 0.25 with
the extra pyramid layer in previous experiments, the detection speed varies from 8 to 10
FPS in different scenarios on mobile CPU.
45
0 0.5 1 1.5 2 2.5 3 3.5 4
10
5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
l
cls
only
constant
exponential decayed
Figure 3.6: Comparison of convergence curves, where the x-axis is the number of itera-
tions.
3.4.5 Evaluation of Training Acceleration
We compare the learning curves of three different training schemes in Fig. 3.6. The first
one is the original one proposed in [56], where only the classification loss is used. The
other two adopt our proposed training scheme with the auxiliary loss function as given
in Eq. (3.3). One has a constant value while the other uses an exponentially decaying
value. For all three settings, a learning rate of 0.008 is used. We observe from these
curves that the auxiliary loss function helps improve the convergence speed, especially
at the beginning. This also verifies our intuition that the use of back-propagation from
the end only results in slow convergence.
3.5 Conclusion
An efficient face detector was proposed in this paper. The new scheme can generate
face proposals quickly by capturing both global and local facial cues to reduce image
46
pyramid levels. Furthermore, a method was introduced to infer face locations from local
facial characteristics. The proposed methods were validated on two popular bench-
marking datasets. The potential of our approach towards the real deployment on mobile
devices was clearly demonstrated by experimental results. As compared with the state-
of-the-art methods, our scheme provides a more attractive solution by considering accu-
racy, model size and detection speed jointly.
47
Chapter 4
Fashion Analysis Using Local Region
Characteristics
4.1 Introduction
Fashion plays an important role in a modern society. Potential applications of fashion
study in apparel, such as personalized recommendation and virtual wardrobe, are of
great financial and cultural interest. With the rapid advancement in computer vision
and machine learning, automated visual analysis of fashion has attracted a significant
research interest in recent years.
Recent work on fashion image analysis relies on well-annotated data [60, 61, 65,
67, 39]. However, manual annotation of fashion data is more complicated than other
data types due to its ambiguous nature. The boundary between fashion concepts can be
subjective. As a result, annotations from multiple labelers are less consistent than other
annotation tasks. For example, the same item labeled as a “jacket” by one person may
be labeled as a “coat” by another. Both can be acceptable given the subtle difference
between these two categories. Instead of spending lots of resources on manually anno-
tating fashion data, some focus on utilizing online resources for fashion studies without
human annotations [121, 91, 90]. These online resources often provide rich but inaccu-
rate (or noisy) metadata [91]. How to take advantage of such online resources remains
an open problem.
48
To address this challenge, we propose a feature extraction method that captures the
characteristics of local regions in weakly annotated fashion images. It consists of two
stages. In the first stage, we aim at detecting three types regions in a fashion image:
top clothes (t), bottom clothes (b) and one-pieces (o). To achieve better clothing item
detection accuracy, the person region is also identified to provide a global view. In the
second stage, we extract discriminative features from detected regions. To obtain repre-
sentative features, we collected a fashion dataset, called Web Attributes, which provides
detailed information in detected regions. Furthermore, we present a method to collect
fashion images from online resources and conduct automatic annotation. It is shown
by experiments that extracted features can capture the local characteristics of fashion
images well and offer the state-of-the-art performance in fashion feature extraction.
The contributions of this work are threefold: 1) proposing a two-stage feature extrac-
tion method to generate separate descriptor for each fashion item in the input; 2) col-
lecting a fashion dataset for feature extractor training; and 3) presenting an automatic
fashion data collection and annotation method from online resources. The rest of this
paper is organized as follows. Challenges in fashion image analysis are highlighted in
Sec. 4.2. The region-based feature extraction method and data collection are explained
in Sec. 4.3. Experimental results are shown in Sec. 4.4. Finally, concluding remarks are
given in Sec. 4.5.
4.2 Review of Related Work
Fashion research covers various specialized tasks. Clothing parsing [60, 61, 121, 122]
aims at segmenting individual fashion items in images. Fashion landmark localization
[65, 67] targets at detecting virtual landmark points on fashion items defined by humans.
Clothing retrieval [65, 39] involves search for the same fashion item in different images
49
under different circumstances (e.g., lightning conditions, viewing angles, deformation,
etc.). Clothing attribute classification [65, 9, 19] focuses on the characteristics of each
fashion item. Clothing style classification [47] deals with the appearance style of indi-
vidual persons. Fashion description [91] extracts features from fashion items, which
can be utilized in solving multiple fashion problems. There exists previous work that
utilized online resources for fashion study without human annotations [121, 91, 90].
Fashion images in PaperDoll [121] were collected based on associated meta-data tags
that denote attributes such as color, clothing item or occasion. For each input image,
similar images were retrieved from the collected database using hand-crafted features.
The prediction was then made by voting from the labels of retrieved similar images to
get a more robust result. Simo-Serra et al. [90] proposed a method to learn and predict
how fashionable a person appears in a photo. To achieve this, a heterogeneous dataset
called the “Fashion144k” was collected automatically online. The dataset contains 11
information types such as the number of fans, location, post tags, etc. Furthermore, they
proposed a conditional random field model based on all information types. However,
the contribution of each information type and the impact of inaccurate tags are unclear.
The impact of noisy labels in the Fashion144k dataset was envisaged in [91], where
a feature extraction network was proposed. It was observed that fashion data deviate
too much from general object or scene datasets, which therefore would not be helpful
fashion learning. Consequently, they would like to find a better way to utilize the noisy
fashion dataset. To suppress noise in each single label, they first filtered out images
with less than three labels. Then, they constructed triplets of images that contains one
reference image, one similar image and one dissimilar image based on noisy labels.
The network was trained jointly on noisy and cleaned labels. The resulting features
outperformed precedent work, including large-scale networks trained on the ImageNet
[93]. Their feature extraction network, however, has two main drawbacks. First, the
50
Figure 4.1: The pipeline of the proposed fashion feature extraction system.
training only relies on color- and category-related labels and, as a result, the obtained
features cannot represent detailed characteristics of fashion items. Second, the network
only captures the global characteristics of an image. It cannot separate each individual
fashion item (say, top and bottom) and extract features correspondingly.
4.3 Proposed Method
As shown in Fig. 4.1, the proposed fashion feature extraction method consists of two
stages: 1) representative region detection and 2) local feature extraction. They are
detailed in Secs. 4.3.1 and 4.3.2, respectively.
51
4.3.1 Representative Region Detection
In the first stage, we aim at detecting the regions of clothing items in a fashion image.
The regions include the top (t), bottom (b) and one-piece (o) clothes. The person (p)
region is also identified to improve clothing item detection accuracy. We adopt a pop-
ular object detection CNN called the Fast RCNN [25] for region detection. The Fast
RCNN detects generic objects and relies on object proposals from Selective Search
(SS) [105]. However, because SS proposals are designed for generic objects, the stan-
dard Fast RCNN is not effective in clothing item detection. To make the Fast RCNN
more suitable for our specific task, we customize it in the following two areas. First, we
replace proposals obtained by SS with uniformly distributed fixed-size bounding boxes,
which are referred to as “anchor proposals”. Second, we add an auxiliary classifier with
the person region as the input for better detection performance. The architecture of the
customized Fast RCNN is depicted in Fig. 4.2.
Conv
RoI
Pooling
FC FC
Box classifier
Box regressor
RoI
Pooling
Global classifier
Person box with the highest score
Anchor
proposals
Figure 4.2: The architecture of the customized Fast RCNN.
As shown in Fig. 4.2, the RoI pooling layer takes in the anchor proposals and gener-
ates an RoI (Region of Interest) feature vector, which is input to the following two fully
52
connected (FC) layers. On top of the two FCs are a region classifier and a bounding
box regressor, which provide the category of the RoI and the bounding box coordinate
offsets, respectively. The training loss [25] consists of the RoI classification loss,L
Cls
,
and the RoI bounding box regression loss,L
Box
, in form of
L
Cls
=
1
jBj
X
b2B
X
c2C
y
c
FR
(b) lnp
c
FR
(b); (4.1)
L
Box
=
1
jB
+
j
X
b2B
+
X
i2fx, y, w, hg
smooth
L
1
(t
i
;b
i
); (4.2)
where B is the anchor proposal set, C is the category set; y
c
FR
(b) and p
c
FR
(b) are the
ground truth and the predicted confidence that bounding boxb contains a fashion item
from categoryc, respectively;B
+
is the positive proposal set where each proposal over-
laps sufficiently with at least one fashion item or a person, and smooth
L
1
(t
i
;b
i
) measures
the box coordinate difference between ground truths and positive proposals. We refer to
the Fast RCNN [25] for further details.
The category set,C, in the customized Fast RCNN includes: the top (t), bottom (b),
one-piece (o) clothes as well as the person (p). The person bounding box can provides
a global view and reduce the errors from RoIs focusing only local regions. For exam-
ple, when a person wears a dress with a belt, the detector may generate two separate
bounding boxes with high scores for the top and the bottom clothes, as the RoIs only
cover the upper or bottom body. To suppress such errors, we train an auxiliary classi-
fier, referred to as “global classifier”, in Fig. 4.2. The global classifier has a multilayer
perceptron structure, and it is trained to predict the existence of three clothing types
based on the person box. More specifically, during the training process, we identify the
boxes, denoted byB
p
, from the Fast RCNN (as shown in the top branch of Fig. 4.2) that
overlap sufficiently with the person ground truth box. The RoI features of this group
of bounding boxes are fed into the global classifier to predict the existence of clothing
53
items for the particular person. We use the following multi-class cross entropy as the
training loss
L
G
=
1
jB
p
j
X
b2B
p
X
c2C
y
c
G
(b) lnp
c
G
(b) + (1y
c
G
(b)) ln(1p
c
G
(b)); (4.3)
where p
c
G
(b) is the existence probability from the global classifier for class c in box b
andy
c
G
(b)2f0; 1g indicates the ground truth existence inb. The total training loss,L,
is the summation of Fast RCNN training loss and the global classifier loss:
L =L
FR
+L
G
=L
Cls
+L
Box
+L
G
: (4.4)
To conduct inference, we take the person box with the highest confidence, b
p
, and
filter out cloth boxes that do not overlap withb
p
. Among remaining boxes, their scores
are weighted by the global score,p
c
G
(b
p
). Namely, the final scores are
p
c
(b) =p
c
FR
(b)p
c
G
(b
p
): (4.5)
Boxes with scores higher than a threshold are preserved for further steps.
4.3.2 Local Feature Extraction
In the second stage, we extract discriminative features from detected clothing regions.
The local feature extraction scheme is illustrated in Fig. 4.3. It is well-known that
shallower convolutional layers capture low-level image details while deeper ones cap-
ture more structured information with high-level semantics [61]. Following this line of
thought, we adopt a multi-scale feature extraction approach that combine features from
both low-level and high-level convolutional layers. More specifically, the low-level fea-
ture and high-level feature are concatenated into a multi-scale feature vector.
54
Figure 4.3: The local feature extraction module.
We use two types of loss functions to train feature extractors. First, we adopt the
triplet loss,L
triplet
[34] to train features directly. It is used to enforce smaller feature
distances between reference and similar inputs and larger feature distances between ref-
erence and dissimilar inputs. We select reference, similar and dissimilar inputs to form
a triplet based on their labels. If an image has more than
sim
labels in common with the
reference image, it is selected as a similar input. On the contrary, if one has less than
dissim
labels in common with the reference, the image becomes a dissimilar input.
Next, we add a couple of output layers as attribute classifiers on top of the feature
layer. Each output layer corresponds to an attribute (e.g. style, neckline, etc.) and
its output dimension is determined by the number of classes in the particular attribute
group. Depending on the property of the attributes, we divide them into part attributes
55
(such as neckline) and global attributes (such as style). The former takes the low-level
features as the input while the latter takes in high-level features. To train those clas-
sifiers, we use the softmax cross-entropy loss, denoted byL
cls,i
, for the i-th attribute
group.
The total loss is a weighted summation of the two losses, in form of
L =
1
K
K
X
i=1
L
cls,i
+L
triplet
; (4.6)
whereK is the number of attribute classifiers, and is the weight for the triplet loss.
4.3.3 Data Collection and Web Attributes Dataset
Table 4.1: Comparison of fashion datasets without human
annotation
Datasets # images
High-level
annotation
Low-level
annotation
PaperDoll [121] 339,797 Category -
Fashion144k [90] 144,169
Category,
Color,
Style*
-
Web attributes 12,594
Style,
Shape,
Category
Pattern,
Neckline,
Placket,
Sleeve,
Decoration
* Obtained by running a style estimator [46]
We collected our fashion data from two online fashion retailer websites: shein.
com andforever21.com. The SheIn website has a limited number of fashion images
and we only collected 3180 images. The Forever21 website has more fashion images,
from which we collected 9414 images.
56
Table 4.2: Comparison of accuracy of the attribute classification tasks.
Features But-
ton
Length Pat-
tern
Shape Collar
shape
Sleeve
length
Sleeve
shape
mean
accuracy
Stylenet [91] 0.4085 0.4718 0.6046 0.4743 0.1855 0.7551 0.6137 0.5019
(a) Stylenet + FT 0.3929 0.4726 0.5845 0.4537 0.1845 0.7615 0.6507 0.5001
(b) Stylenet + FT
+ LFE
0.4092 0.5241 0.6624 0.4922 0.2820 0.7825 0.6758 0.5469
(c) ResNet-50 0.2769 0.3981 0.5137 0.3469 0.1093 0.7527 0.5950 0.4275
(d) ResNet-50 +
FT + LFE
0.3701 0.5048 0.6624 0.4935 0.2820 0.7825 0.6758 0.5387
(e) Fusion of
(a)+(d)
0.4301 0.5574 0.6624 0.5478 0.2820 0.7827 0.6758 0.5626
FT and LFE denote the “fine-tuned on the Web Attributes dataset” and the “local feature
extraction”, respectively.
Both websites provide descriptions with different formats for fashion images.
Descriptions in the SheIn website consist of list of keywords while those in the For-
ever21 website are long sentences. We took keywords in SheIn’s descriptions as ref-
erences and use them to label each image from Forever21. Similar keywords such as
“V-neck” and “V-neckline” are combined using the similarity score given by the Word-
net [77]. Then, we searched Forever21’s descriptive sentences for reference keywords
in the SheIn website and their synonyms. These searched keywords became labels of
corresponding images. We also analyzed the part of speech for each keyword to reduce
ambiguity between words. Afterwards, we only kept labels that frequently appear in the
collected descriptions. At the end, we have 159 types of attribute labels. They can be
coarsely divided into eight label groups: 1) pattern, 2) neckline, 3) style, 4) placket, 5)
decoration, 6) sleeve, 7) shape and 8) category.
Based on the aforementioned automatic data labeling procedure, we created a new
fashion dataset called Web Attributes. We compare several fashion datasets with little
57
human annotation in data labeling in Table 4.1. Some exemplary images from the Web
Attributes dataset are shown in Fig. 4.4.
4.4 Experiments
Data and implementation details. The customized Fast RCNN used the VGG-16[93]
as the backbone and was pre-trained on the ImageNet [17]. We fine-tuned it with the
Clothing Parsing [60] dataset, with the segmentation annotations converted to bounding
boxes. The images in the Clothing Parsing dataset are randomly partitioned into train,
val and test sets, with 4,622, 1,451 and 1,450 images, respectively.
For fashion item feature extraction, we chose two networks in the experiments –
ResNet-50 [31] and Stylenet [91]. The ResNet-50 was first pre-trained from the scratch
using Fashion 144k’s images with noisy category labels. For the Stylenet, we adopted
the pre-trained model from [91]. Then, we applied the proposed system to these two
networks and fine-tuned them on the Web Attributes dataset.
Being similar to [91], we evaluated the performance of the proposed feature extractor
on an external fashion dataset. The work in [91] demonstrated its performance on the
Hipster Wars dataset, which has less than one thousand valid images in total. Here, we
considered the large-scale DARN dataset [39]. Although some of the images are no
longer available, we still obtained 158,088 images from its online subset. Each image in
the subset has associated attribute labels such as button, length, shape, etc. These images
were partitioned into training, validation and testing subsets with a ratio of 5:1:4.
Detection. The evaluation of the customized Fast RCNN is conducted on the test split
of the Clothing Parsing dataset. Since images in this dataset have at most one clothing
item annotated in t, b, o, we first select the best-scored bounding box for each category
and abandon those with scores lower than
det
= 0:5. Then, the remaining boxes are
58
Figure 4.4: Exemplary images of the Web Attributes fashion dataset.
Table 4.3: The performance of the customized Fast RCNN on the Clothing Parsing
dataset with the test split.
Category Person Top Bottom One-piece
Recall 0.984 0.887 0.745 0.648
Precision 0.984 0.832 0.705 0.724
F-measure 0.984 0.859 0.591 0.634
59
Figure 4.5: Comparison of the top-5 retrieval results between the Stylenet [91] and the
proposed feature extractor.
compared with ground truth (GT) boxes. A detected box is a true positive (TP) if and
only if it has an intersection over union (IoU) larger than
IoU
= 0:5 with a GT box
and its predicted category is correct. The results are shown in Table 4.3. We see that
the detection results for the person category is much better than others. This could be
attributed to the reason that the ImageNet pre-trained model was used. Numerous person
images had been used in pre-training and many of them are harder than the Clothing
Parsing dataset in terms of person detection.
Classification. The classification experiment performed on the DARN dataset is sim-
ilar to that in [91] performed on the Hipster Wars dataset. We extracted features for
each image. Then, we trained logistic regression classifiers for attribute prediction tasks
using the training subset, tuned hyper-parameters on the validation subset and, finally,
check the performance on the testing subset. The testing results are listed in Table 4.2,
60
where FT and LFE stand for the “fine-tuned on the Web Attributes dataset” and the
“local feature extraction”, respectively. By comparing results of Stylenet [91] in the
top row and method (a), we see that simply fine-tuning the network on the collected
dataset does not help much, since detailed characteristics of local regions still cannot
be captured. The results of method (b), which was trained with local feature extraction,
outperform the previous two methods. This demonstrates the importance of features in
local representative regions. The comparison between methods (c) and (d) also verifies
this point. The final fusion method, which integrates the characteristics in both global
and local regions, gives the best performance. It has 6% gain over the Stylenet in the
mean accuracy measure as shown in the last column in Table 4.2.
Retrieval. We also conducted a retrieval experiment to compare our proposed feature
extractor with the Stylenet [91]. We used the DARN testing subset as the retrieval
database and compare the features using the Euclidean distance. Fig. 4.5 gives the
top-5 retrieval results. We see from the first and the second rows that the proposed fea-
ture extractor captures important patterns and shapes of the query stripped dress while
the Stylenet fails. Comparison of the third and the fourth rows illustrates that the pro-
posed feature extractor focuses more on the category and the shape of clothes while the
Stylenet retrieves different cloth types of the same color.
4.5 Conclusion
We proposed a two-stage method for representative region feature extraction from fash-
ion images by leveraging weakly annotated web images. A new fashion dataset called
the Web Attributes from online resources with automatic annotation was constructed.
By providing both global and local region characteristics, our feature extraction method
can zoom into local regions which cannot be easily detected. Its superior performance
61
in representative region detection, attribute classification, and similar cloth retrieval was
demonstrated by experimental results.
62
Chapter 5
Fashion Outfit Compatibility Learning
with Global and Local Supervisions
5.1 Introduction
Learning color compatibility in fashion outfits helps us identify what colors are com-
patible in the fashion industry and how to choose compatible colors for an outfit. While
many factors contribute to the compatibility of fashion outfits, the compatibility of color
plays a pivotal role. When verifying whether a set of fashion items form a compatible
outfit, the first check that comes to one’s mind is whether their colors are compatible or
not. Unlike silhouettes that each human body may have different match, color compati-
bility is more universal and applicable to a wider audience of people. Therefore learning
color compatibility for fashion will help us better identify compatible outfits and provide
recommendation with maximum color synergy.
Previous color compatibility studies are more of theoretical interest rather than based
on real world data, and do not necessarily apply to fashion problems. The harmonic
color templates developed by Matsuda [75, 104] were later adopted for natural image
color harmonization in [11, 100]. Nevertheless, the authors of [81] noticed a wide gap
between the theories and the large color datasets collected from human users. Moreover,
those methods for natural image color harmonization do not specifically apply to fashion
domain. Besides, the fashion trend is notorious for changing. Therefore explicit learning
63
color compatibility for fashion is necessary and we would like to implement it in a data-
driven manner.
Despite the importance of colors to fashion, using color information to help identify
fashion outfit compatibility is rare. Most previous studies either ignore the importance
of color compatibility or implicitly embed colors together with other information. Only
recently, as people look into to interpretable fashion compatibility, color information
starts to get people’s attention. In [102], dominant colors are extracted from each image.
An influence score is predicted for each color to indicate whether this color contributes
to the compatibility of the corresponding outfit or not. Furthermore, in [126], color is
used as one of the rich attributes that help interpret fashion compatibility. These studies
only focus on the impact of each individual color but ignore the compatibility of colors.
In our work, we argue a graph is a good representation for an outfit. Most previous
work focuses on learning the pairwise compatibility between two items rather than the
compatibility of entire outfits [106, 127, 12, 126, 64]. In [29], the compatibility of
entire outfits is learned by a Recurrent Neural Network (RNN). However, it requires a
predefined order to convert a set of items into a sequence. Recently, Cui [13] proposes
to use undirected graphs to model outfits where each node in the graph represents an
item. A graph represents the entire set of items without the artificial order and naturally
models the relation among fashion items.
We also model outfits with graphs but propose a novel graph construction method
that is more suitable for compatibility learning. Most existing graph convolutional net-
work approaches focus on node information aggregation. However, in the compatibility
problem one should focus on the relation among items, which is represented by the edge
information in the existing graph construction [13]. Instead of using nodes to represent
fashion items, we model each pairwise relation between items as a node to utilize the
full power of graph convolutional network methods.
64
To interpret the color compatibility patterns, we further propose a joint training
scheme to learn compatibility prediction as well as outfit clusters. Although outfit com-
patibility is available in the data, it is still challenging to cluster the outfits such that the
color compatibility is more interpretable. Therefore we obtain meaningful outfit clusters
to interpret color patterns of each obtained cluster. The clusters obtained further help
the network learn the compatibility better as demonstrated in the experimental results.
To summarize, our contributions are three-fold:
1. We specifically learn color compatibility for fashion outfits in a data-driven man-
ner and reformulate it as a graph learning problem. The proposed compatibility
learning method is also applicable to learning compatibility from other fashion
representations including deep image features.
2. We present a new graph construction method that is more suitable for learning the
compatibility of a set of fashion items.
3. We propose a joint training scheme to learn compatibility prediction and outfit
clustering, which helps the interpretation of the prediction as well as improves the
prediction performance.
5.2 Related Work
5.2.1 Color Compatibility
Color template theories were widely adopted for color compatibility problems. The
set of hue templates proposed by Matsuda in [75] and other variants are frequently
used by later work [104, 11, 100]. However, experiments on large-scale color theme
datasets [81] demonstrate a large distance between the hue templates and real-world
65
0 Not compatible
1 Compatible A
2 Compatible B
3 Compatible C
Graph construction Graph embedding Compatibility prediction
& outfit clustering
Palette extraction
Feature
conversion
Figure 5.1: An overview of our proposed method for learning color compatibility in
fashion outfits. First color palettes are extracted from each fashion item image as the
fashion feature. Then a graph is constructed such that each node represents the pairwise
relation between two fashion items. Afterwards the constructed graph together with the
extracted fashion features are embedded into a single vector embedding, which is used
for the final compatibility prediction. The joint training scheme for compatibility predic-
tion and outfit clustering help improve the prediction performance and interpretability.
data. Instead, O’Donovan [81] proposed to learn a regression model for color compat-
ibility from color theme data. Each color theme consists of five colors collected from
human users. This color compatibility regression model is further adopted in [131] to
generate compatible outfits. The limitation of this model is that it only evaluates sets
of five colors. Furthermore, those color themes are for general purposes and are not
specifically related to fashion.
5.2.2 Fashion Outfit Compatibility
Many previous work in fashion outfit compatibility learning first embedded each fashion
item, then used the embedding distances as the measure of pairwise compatibility [107,
8, 106, 45, 10, 129]. In [127], a generalization of this concept was proposed. Another
set of work directly learned a model for some fashion task [103, 59, 12, 71, 84, 13],
which were hard to interpret and transferred to other tasks.
66
Some recent work tended to add interpretability into models via disentanglement. In
[22], each item embedding was disentangled using extra attribute labels. A score was
assigned to each attribute as interpretation. In [102], each image was disentangled into
color and shape/texture features, then the influence of each feature type was studied.
Tree models were used in [126] to derive decision rules for pairwise item compatibility.
Item embeddings were generated in [101], which utilized multiple learnable similarity
condition masks. A comment generator was proposed in [64], which required extra user
comments for training.
Most approaches above only modelled pairwise compatibility between fashion
items, whereas the compatibility of an entire outfit is either not considered or predicted
as simple average of the pairwise compatibilities. Besides the work mentioned above,
[40, 95] also fall into this category.
To model the compatibility of entire outfits, RNNs were utilized in [29, 79]. How-
ever, when a set of fashion items are modelled as a sequence, a pre-defined order is
required. In [13], an outfit was modelled as a graph, which avoided the explicit pairwise
compatibility modelling and virtual orders of the items. We also model an outfit as a
graph but the graph construction is different.
5.2.3 Graph Convolutional Networks
Unlike convolution on a grid which is straightforward, convolution over graphs has been
studied and evolved over the years [6, 32, 16, 50, 28]. One drawback of most existing
graph convolutional networks is that they only utilized the node features. In their work,
edge features were either ignored or only 1-D edge weights were used.
A few attempts were made to overcome this drawback. In [92], an edge-conditioned
convolution was proposed where edge information was utilized to filter the nodes. More-
over, edge features were directly utilized for node feature aggregation functions in [26].
67
In these work, edge features are only considered as an auxiliary for node feature aggre-
gations and do not get updated throughout the network inference. Monti et al. [78]
proposed a Dual-Primal GCN framework, which alternated between dual and primal
convolutional layers. The dual convolutional layer applied graph attention on the dual
graph to produce features on the edges of the primal graph, and the primal edge features
were used in the primal convolutional layer to compute attention scores for producing
primal node features. In [48], 2-D edge features defined as (dis)similarities between
pairs of nodes were updated following each update of nodes. Five options for updating
edge features were discussed in [112], whereas all of them were based on node fea-
ture updates. For fashion compatibility, however, direct aggregation and update of edge
features are desired.
5.3 Learning Color Compatibility in Outfits
An overview of our color compatibility prediction system is shown in Figure 5.1. It
consists of three modules: color feature extraction which we will discuss in Section
5.3.1, graph construction and embedding which we will discuss in Section 5.3.2, and
joint compatibility prediction and outfits clustering which we will discuss in Section
5.3.3.
Although we use color information as fashion item representations for compatibility
learning, our proposed framework can be widely applied to other representations as
well. In section 5.4, we demonstrate the effectiveness of our framework with other
fashion item representations.
68
5.3.1 Color Feature Extraction
We propose to use the color palette, a collection of representative colors, as the color
representation for an item. Previous work [120, 33, 37] on fashion study usually use
color histogram as features. Here we instead use the color palette for several reasons:
First, our goal is to study the compatibility among colors, rather than the statistics of
color distribution; Second, the illumination variation, which can be viewed as a noise,
is encoded in the color histogram but color palette is more invariant to illumination
changes; Third, color palettes are more abstractive, from which we can better learn the
compatible patterns.
To extract color palettes, we first convert the fashion image pixels from the RGB
color space toLab, whereL is the lightness,a indicates the degree between green and
red, and b indicates the degree between blue and yellow. In the Lab color space, the
distance between two colors indicates their visual difference from humans perception.
We use the K-means clustering algorithm with Euclidean distances to cluster the pixels
from each fashion item. The cluster centroids form the palette of the corresponding
fashion item.
5.3.2 Graph Construction and Embedding
We propose a novel graph construction method for outfits. Since the outfit compatibility
is the global relation among outfit items, we aim to learn a global outfit relation by
aggregating from local pairwise relations between items. For a given outfit ofN items,
we construct a graph G = (V;E) with
N
2
nodes where each vertex represents the
pairwise relation between two fashion items in the outfit. For each pair of nodes that
represent the relations to a common item, we connect them with an edge. An illustration
69
0
1
(a)
(b)
Hidden
layer
ReLU ReLU
Graph
pooling
Hidden
layer
Hidden
layer
0
1
Graph
input
Figure 5.2: (a)The graph construction step by modelling each pairwise item relation as
a node. The node features are obtained by embedding two item features. (b) The graph
embedding step with GCN.
is shown in Figure 5.2. Comparing to the graph construction method in [13], our propose
graph can be viewed as its line graph (also known as the edge-to-node dual graph).
For the nodeV
i;j
2V that represents the relation between theith and thejth items,
the node feature is obtained from features of these two items as
f
V
i;j
=E(f
i
)E(f
j
); (5.1)
whereE() is an embedding module, denotes the Hadamard product andf denotes
the feature.
70
We can apply various GCN models to aggregate the node information and embed
the graph. In this work, we use GraphSage [28]. Atl-th layer, the hidden representation
of nodeV
i;j
inG is updated as
m
l+1
V
i;j
= AGGREGATE
l
fh
l
V
0;8V
0
2N (V
i;j
)g
;
h
l+1
V
i;j
=
W
l
[h
l
V
i;j
;m
l+1
V
i;j
]
;
(5.2)
whereN
V
i;j
is the neighborhood ofV
i;j
, [;] denotes concatenation, AGGREGATE()
is a aggregation function in [28] and is a nonlinear activation function. We first obtain
the new message m
l+1
V
i;j
by aggregating information from the neighbors of nodeV
i;j
,
then the updated hidden representationh
l+1
V
i;j
is obtained from the new message and the
previous hidden representation of nodeV
i;j
. The node features at the last layer are then
globally pooled into a graph embedding, using which we can predict the compatibility.
5.3.3 Compatibility Prediction and Outfit Clustering
By using both compatible and incompatible outfits as training data, we can train our
network to predict the compatibility. There are different reasons for a set of colors to
be considered compatible, e.g. similar colors, or different colors with a synergy. To
further understand such differences in compatibility patterns, we propose to cluster the
outfits in the embedding space. Each cluster can thus be interpreted as outfits with
similar patterns of color compatibility. A straightforward approach of clustering is to
first train a predictor with the compatibility training data, then cluster the outfits in their
embedding space for pattern exploration. One drawback of this approach is that the
clustering algorithm is applied to the embeddings that are fixed after training. It is hard
to verify if the clusters obtained are meaningful for compatibility prediction. Thus the
interpretation from those clusters cannot help us understand the prediction.
71
Clustering
Pseudo label learning
Pseudo label generation
0
1
2
3
1
0
2
3
Assigning
cluster labels
Generating
sample labels
0 2
2
2
2
2
1
0
0
0
1
0
Collecting graph
embeddings
1
2
3
0
1
0
2
3
Figure 5.3: Illustration of the proposed joint compatibility prediction and outfit cluster-
ing method. We alternatively generate pseudo labels for every sample and update the
network parameters using the pseudo labels.
Rather than learning fashion compatibility patterns via two separate steps, we pro-
pose a joint classification and clustering training scheme that alternatively generates
pseudo labels from the network and updates the network with pseudo labels.
An illustration of the proposed joint classification and clustering scheme is shown in
Figure 5.3. The pseudo label generation procedure consists of four steps:
Collecting graph embeddings. We first cluster the graph embeddings of all training
outfit samples. Then for each samplei with the outfit inputx
i
and the original labely
i
,
we generate a pseudo multi-class label ~ y
i
2f0; 1; ;C 1g .
Clustering. The current embeddings of outfits are clustered toC clusters. C = 4 is
used for illustration in Figure 5.3. For clusterc2f0; 1; ;C 1g, we calculate the
ratio r
c
between the number of compatible samples and incompatible samples within
the cluster.
72
Assigning cluster labels. The obtained clusters are ranked by the ratiosr
c
. The top
n
pos
clusters are assigned with positive cluster labels, whereas the restn
neg
=Cn
pos
clusters are assigned with negative clusters labels. Figure 5.3 illustrates the case where
n
pos
= 3 andn
neg
= 1.
Generating pseudo sample labels. For each sample (x
i
;y
i
) in clusterc, if its label
y
i
equals its cluster label, its pseudo label is assigned asc. If its labely
i
does not equal
its cluster labelz
c
, we assign it to the closest clusterc
0
that has the cluster label equal to
y
i
. Examples are illustrated in Figure 5.3.
In this way, we fuse the cluster assignments with classification labels to form the
pseudo multi-class labels, then use them to update the embedding network. The cluster-
ing method DeepCluster [7] iteratively clustered the features obtained from the network,
then used the cluster assignments as pseudo labels to update the network parameters. In
our method, we make use of the available compatibility labels. The pseudo labels are
obtained by fusing the ground-truth compatibility labels with the cluster assignments.
5.3.4 Global and local supervisions
The original data collected in existing datasets (Polyvore Outfits [106] and Maryland
Polyvore [29]) only contain compatible outfits as positive samples. For the tasks of com-
patibility prediction and fill-in-the-blank, negative outfits were constructed by randomly
sampling items in the data. We also follow this random sampling strategy to construct
negative samples for our training procedure. For each positive-negative pair of samples,
the negative sample was generated by randomly replacing some of the items. For each
compatible outfit provided as a positive sample, we randomly replace several items from
the same item categories to form the incompatible outfit as a negative sample.
73
Incompatible
Compatible
Global Local
Global Local
Figure 5.4: Graph constructions for global and local supervision of a given pair of sam-
ples. For global supervision, all items in the outfits are used. For local supervision,
common items in the positive-negative sample pair are not used. Eliminated nodes and
edges are denoted in dotted lines.
We propose to use the two loss functions in Equations 5.3 and 5.4. Figure 5.4 illus-
trates the corresponding graph construction where each sample contains five items and
the two samples have two items in common.
74
The first loss is a global loss that is applied to entire outfits. As illustrated in Figure
5.4, the compatible outfit on the top represents a positive inputx
i;pos
and the incompati-
ble outfit at the bottom represents a negative inputx
i;neg
. We construct the graphs using
every items provided in the sample pair, as illustrated in Figure 5.4 on the left. We use
the categorical cross-entropy loss to update the network parameters with pseudo labels:
l
1
=
X
i
[l
CCE
([g(x
i;pos
)]
~ y
i;pos
; ~ y
i;pos
)
+l
CCE
([g(x
i;neg
)]
~ y
i;neg
; ~ y
i;neg
)]
=
X
i
log
[g(x
i;pos
)]
~ y
i;pos
log
[g(x
i;neg
)]
~ y
i;neg
;
(5.3)
wherel
CCE
(;) is the categorical cross-entropy loss,g[]
c
is the predicted score for the
classc2f0; 1; ;C 1g, and the subscriptions
i;pos
and
i;neg
denote the positive and
the negative outfits of thei-th pair of samples, respectively.
Additionally, we use a local loss that is applied to subsets of the outfits. It helps the
network learn more subtle differences between item relations. For each positive/negative
pair of outfits, we remove their common items from both outfits to obtain a pair of outfit
subsets. The intuition is that the subset of a compatible outfit should also be compatible,
and randomly selected items are most likely to be incompatible. The correspondingly
constructed graphs are shown on the right side of Figure 5.4. Then for this pair of outfit
subsets, we apply a binary cross-entropy loss with the compatibility labels. Let p()
75
denotes the sum of [g()]
c
for all positive clustersc, i.e. the predicted probability that a
sample is compatible, the local loss can be formulated as
l
2
=
X
i
[l
BCE
(g(^ x
i;pos
); y
i;pos
)
+l
BCE
(g(^ x
i;neg
); y
i;neg
)]
=
X
i
[ log (p(^ x
i;pos
)) log (1p(^ x
i;neg
))]:
(5.4)
where l
BCE
(;) denotes the binary cross-entropy loss and ^ x denotes the subset of x
obtained by removing the common items. The final loss function takes the weighted
sum ofl
1
andl
2
as
l =l
1
+l
2
: (5.5)
5.4 Experimental Analysis
5.4.1 Datasets
Polyvore Outfits [106]. This dataset contains 53,306, 5000, and 10,000 outfits for
training, validation, and testing, respectively. It provides one image per fashion item
and some items also come with text descriptions. Each outfit contains 2 to 19 fashion
items (5.3 in average) and each fashion item in average appears in 1.4 outfits. Two tasks
are provided for benchmarking. One task is the compatibility prediction (CP), in which
the compatibility scores predicted for given outfits are evaluated using area-under-curve
(AUC). The other task is the fill-in-the-blank (FITB), where one needs to choose one
fashion item out of four candidates that fits the given outfit subset best.
Maryland Polyvore [29]. This dataset is similar to the Polyvore Outfits dataset, but
its size is smaller. It has 17,316, 1,407, and 3076 outfits for training, validation and
76
testing, respectively. It also has the two tasks (CP and FITB) for benchmarking. The
test set provided by [29] contains negative samples that are not challenging enough. We
follow the previous studies [106, 101] and evaluate our model on a more challenging
test set proposed in [106], which is obtained through the same procedure as in Polyvore
Outfits.
5.4.2 Implementation Details
As the preprocessing step, we first separate fashion item from image backgrounds using
[5], then we convert the fashion item pixels from the RGB color space to the Lab color
space. Afterwards, we extract a 3-color palette from each fashion item image.
We use one fully-connected layer for node feature embedding and a 3-layer Graph-
Sage [28] with ReLU activation and global max-pooling for graph embedding. The
hidden representations are 60-D and the final embedding is 20-D. We utilize K-means
with cosine similarity for clustering and re-assign the pseudo labels to outfits in every
other 25 epochs of training. We setn
pos
= 4,n
neg
= 1 and = 0:5.
For fair comparison with previous work, we also follow previous work [106, 101]
and utilize 64-D deep image features from ResNet-18 [31], as well as 6000-D text
features using HGLMM Fisher vectors [51] from word2vec [76] with PCA dimension
reduction for various experiment settings.
5.4.3 Comparison with Previous Work
We compare with the results of the following methods to demonstrate the effectiveness
of our method:
Bi-LSTM [29]. It utilizes the deep image features extracted from Inception-v3 [99].
The text descriptions are used to regularize the visual feature space during training.
77
Siamese Network [107]. The approach is re-implemented in [106] with ResNet-18
[31] features.
Type-Aware Embedding Network [106]. It uses ResNet-18 [31] as its backbone
and embeds text using HGLMM Fisher vectors [51] with PCA. Furthermore, it also
requires the category information of each fashion item during both training and testing.
SCE-Net [101]. It follows the same procedure of feature extraction and feature
usage, except for the category information, which is not required during training or
testing.
The comparisons are conducted in two groups. In the first group, the images are used
as inputs for both training and testing. The image representations used are either color
palettes extracted using our method, or deep image features extracted from convolutional
neural networks. In the second group, besides images, the text descriptions are used to
regularize the embedding during training, using the method proposed in [29].
The results demonstrate the importance of color compatibility, the effectiveness of
the proposed method and its capability to various types of fashion representation. The
results of the first group are listed in Table 5.1. With 9-dimensional color palettes as
image representation alone, our results on the two datasets are already comparable to
previous methods which use deep features as image representation. On the large-scale
Polyvore Outfits dataset, specifically, our method with color palettes even outperforms
Siamese Net and Type-Aware Embedding Network. Using deep features further boosts
the results of the proposed method. As listed in Table 5.2, similar conclusion can be
drawn from the results of the second group. It verifies that the proposed method is
applicable to fashion compatibility tasks with various fashion representation.
78
Methods
Image
representation
Polyvore Outfits Maryland Polyvore
Compat
AUC
FITB
Acc
Compat
AUC
FITB
Acc
Siamese Net [107] Deep features 0.81 52.9 0.85 54.4
Type-Aware Embed-
ding Network [106]
Deep features 0.83 54.0 0.87 57.9
Ours
Color palettes 0.84 58.0 0.83 49.1
Deep features 0.90 59.1 0.91 59.4
Table 5.1: Comparison with previous methods using color palettes or deep features as
image representation. For Siamese Net, we adopt the performances reported in [106].
All the methods extract deep features from ResNet-18.
Methods
Image
representation
Polyvore Outfits Maryland Polyvore
Compat
AUC
FITB
Acc
Compat
AUC
FITB
Acc
Bi-LSTM*[29] Deep features 0.65 39.7 0.94 64.9
Type-Aware Embed-
ding Network [106]
Deep features 0.86 55.3 0.90 61.0
SCE-Net [101] Deep features 0.91 61.6 0.90 60.8
Ours Deep features 0.90 65.8 0.91 66.6
Table 5.2: Comparison with previous methods. Text information is additionally utilized
during training as regularization. For Bi-LSTM and Siamese Net, we adopt the perfor-
mances reported in [106]. All the methods extract deep features from ResNet-18, except
for Bi-LSTM, which uses Inception-v3.
By comparing results across the two datasets, we find the proposed method is more
robust. When switching from the small-scale Maryland Polyvore dataset to the large-
scale Polyvore Outfits dataset, the performances of most previous methods drop dramat-
ically. On the contrary, proposed method have comparable results on both datasets.
5.4.4 Ablation Studies
We compare the following methods as ablation studies to evaluate the contribution of
each proposed module:
79
Methods #clusters Compat AUC FITB Acc
NG - 0.76 42.0
LG - 0.84 52.7
LG+l
1
5 0.84 54.7
LG+l
1
+l
2
3 0.84 54.2
LG+l
1
+l
2
5 0.84 58.8
LG+l
1
+l
2
9 0.84 55.1
Table 5.3: Ablations studies on how each proposed module contributes to the final per-
formance. The experiments are conducted on the validation set of Polyvore Outfits with
color palettes as fashion representations.
NG. The graph is constructed such that each node represents a fashion item in the
outfit. The binary cross-entropy loss is used.
LG. As proposed in Section 5.3.2, the graph is constructed such that each node
represents a pairwise relation between fashion items. The binary cross-entropy loss is
used.
LG+l
1
. On the proposed graph, we apply joint training as described in Section 5.3.3,
with thel
1
loss in (5.3) only.
LG+l
1
+l
2
. Our proposed method.
The ablation results are evaluated on the validation set of the Polyvore Outfits dataset
and listed in Table 5.3. All methods compared use color palettes.
Each proposed module contributes to the final performance. By comparing the
results of NG and LG, the proposed graph construction significantly improves the per-
formance on both tasks. With joint training to our constructed graphs, the accuracy of
the FITB task is boosted. It shows that the joint training helps the network learn the
subtle difference of item relations. From experiments on different number of clusters on
the validation set, we find that the optimal performance can be achieved with only a few
clusters.
80
(a) Cluster 1: The visualization suggests that the items of the outfits in this cluster have similar colors.
(b) Cluster 2: The visualization suggests that the outfits in this cluster have the combination of two major
colors that distinct in hues but similar in chromas and lightnesses, and one neutral color.
(c) Cluster 3: The visualization suggests that the items of the outfits in this cluster have colors in high
variance of hues, chromas and lightnesses.
(d) Cluster 4: The visualization suggests that the outfits in this cluster have colors in high contrast of
chromas. Specifically, there are one major color with high chroma as spotlight and one neutral color
(black, white) for balance.
Figure 5.5: Compatible outfit clusters predicted using color palettes as fashion represen-
tation. Due to limited space, we only display 50 outfits per cluster and at most 10 items
per outfit. The 50 outfits are shown on the left where items in the same column belong
to the same outfit. Two outfit samples in the bounding boxes are enlarged and shown on
the right.
5.4.5 Outfit Cluster Visualization
To gain more insights of our color compatibility model, we visualize the clusters
obtained from learning the fashion outfits. For each outfit sample in the validation set of
81
Figure 5.6: A compatible outfit cluster predicted using deep image features that reveals a
color pattern. The items in the same column belong to the same outfit. The visualization
suggests that the items of the outfits in this cluster have similar color.
False
True neg.
True pos.
True pos.
True pos.
Figure 5.7: TSNE visualization of the outfit embeddings obtained from a model trained
with one negative cluster. The black points are the samples whose compatibility are
wrongly predicted and the points in other colors are the samples whose compatibility
are correctly predicted. Furthermore, each color indicates a different cluster predicted
by the trained model.
82
the Polyvore Outfits dataset, we obtain its predicted cluster using our color compatibility
model. As illustration, 50 outfit samples in each positive cluster are displayed in Figure
5.5. For better demonstration, two outfits in boxes from each cluster are enlarged on the
right.
We can see that each cluster has its own color pattern. The visualization in Figure
5.5(a) suggests that the items in each outfit share similar colors. The cluster in Figure
5.5(b) demonstrates the combination of two major colors that are distinct in hues but
similar in chromas and lightnesses, and one neutral color with low chroma such as black
or white. Interestingly, most outfits in this cluster also have transition items that con-
tain both colors. On the other hand, Figure 5.5(c) displays a cluster where item colors
show high variance in lightnesses, chromas and hues. The color pattern in Figure 5.5(d)
demonstrates the contrast in chroma. There is one major color with high chroma in each
outfit as spotlight, and a neutral color that has low chroma such as black or white.
It is not surprising that we can also find some color patterns in the visualization of
the model trained on deep image features. One example of the clusters is shown in
Figure 5.6, which displays the color pattern of similar colors. It again verifies our claim
that color is an essential impact factor for outfit compatibility.
In Figure 5.7, we visualize outfit embeddings on the validation set of Polyvore Out-
fits with one negative cluster. Different colors indicate the predictions made by our
model. The samples with wrongly predicted compatibility are denoted in black color.
The other colors indicate predicted clusters of the samples whose compatibilities are
correctly predicted. It shows that the embeddings of negative samples are concentrated
after the joint training. Thus there is no need to further divide the negative samples.
The experimental results also verified that assigning multiple negative clusters does not
improve the performance.
83
Cluster 1 Cluster 3 Query
151
236
256
480
(a) Cluster 1: similar colors. Cluster 3: high variance of lightnesses, chromas
and hues.
8
Cluster 1 Cluster 4
25
391
Query
(b) Cluster 1: similar colors. Cluster 4: high contrast of chroma.
175
Cluster 2 Cluster 3
24
117
Query
(c) Cluster 2: two major colors distinct in hues but similar in chromas and light-
nesses, and a neutral color. Cluster 3: high variance of lightnesses, chromas and
hues.
Figure 5.8: Left: fashion items of the given incomplete outfits. Right: fashion item
recommendation based on different patterns of compatible colors.
84
5.4.6 Fashion Recommendation Based on Color Compatibility
Given a set of fashion items as an incomplete outfit, we explore the fashion item recom-
mendation problem using our color compatibility model. We make use of the incomplete
outfits from the FITB task in the Polyvore Outfits dataset, then we replace the original
four candidates with 1000 randomly sampled candidates. Since our model is jointly
trained for compatibility prediction and outfit clustering, we can not only provide item
recommendation, but also recommend items with different color patterns.
In Figure 5.8, we display recommendations for three incomplete outfits using the
same color compatibility model used for outfit cluster visualization in Section 5.4.5.
5.5 Conclusion
In this paper, we presented a framework for learning color compatibility in fashion out-
fits, which plays an essential role in outfit compatibility. We first extract simple color
palettes as fashion representations. To learn the compatibility of an outfit globally, we
model each outfit as a graph such that each node represents the pairwise relation between
two fashion items. Then a Graph convolutional network is applied to embed the graph.
To interpret the patterns of color compatibility, we proposed a joint learning scheme for
compatibility prediction and outfit clustering using generated pseudo-labels. Finally we
conduct thorough experiments to demonstrate the effectiveness of our proposed learning
framework. The results show that the proposed method is able to learn color compati-
bility, as well as fashion compatibility with other representations.
85
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
In this thesis proposal, we focused on local-aware deep learning techniques from three
different aspects: multi-modal attention mechanism, facial parts detection, and local
fashion details extraction.
Specifically, we designed a generative visual dialogue system using multi-modal
attention mechanism and weighted likelihood estimation method. Our proposed multi-
modal recurrently guided attention attends to image, question and dialogue history
inputs simultaneously and combine these multi-modal information to provide clues for
responses. The weighted likelihood estimation method utilizes given negative responses
during the training process and therefore improves the quality of generated responses
during the testing process. The proposed method achieved state-of-the-art performance
on standard benchmark VisDial dataset.
For face detection, we aimed at real-time face detector on mobile devices, which
faces three major challenges: fast speed, small model size and high accuracy. We pro-
posed a new proposal network which can be combined with most modern face detectors
to accelerate the detection speed by detecting multi-scale faces in one forward pass.
Particularly, we detected entire faces and facial parts at the same time, and then use
detected facial part regions to infer face regions. The proposed method was validated on
two popular benchmark datasets.
86
For fashion feature extraction, we proposed a two-stage method which leverages
weakly annotated web images. We collected a fashion dataset from online resources and
constructed automatic annotation by utilizing the data structures of online resources.
Our feature extraction method can zoom into local regions which cannot be easily
detected. The proposed method showed good results in representative region detection,
attribute classification and similar cloth retrieval experiments.
For fashion outfit compatibility, we proposed an interpretable compatibility predic-
tor. We first modelled the fashion outfits as graphs and proposed a novel graph con-
struction. Based on the outfits graphs, we developed a joint compatibility prediction and
outfit clustering method to improve the prediction accuracy as well as the interpretabil-
ity of the prediction. The experimental results on two benchmark datasets validate the
effectiveness of our proposed method.
6.2 Future Research
We have realized the importance of utilizing local information. It is natural that some
local regions are more important/relevant than others to people’s certain needs. For
example, if we are looking for a cat in an indoor image, then the walls will not be of
our interests. To further develop the usage of local information, attention mechanism is
a research direction that worth exploration.
Currently, attention mechanism is widely applied to many applications such as image
captioning [119, 69], visual question answering [118, 128] and visual dialogue [68, 116].
Conventional attention methods attend to certain regions of feature tensors by weighted
sum method [68, 116, 119, 69, 118, 128]. Specifically, these methods assign weights
to each region of the feature tensors and then sum these weighted feature regions up to
87
Figure 6.1: Illustration of conventional attention methods. The input is a three dimen-
sional tensor which has wh features of size c 1, representing features of wh
regions. The output is a weighted sum feature of sizec 1.
obtain a global attention feature. An illustration of attention on a three dimensional fea-
ture tensor (e.g, image features extracted from convolutional layers) is shown in Figure
6.1.
We bring up two research problems as follow:
How to incorporate context information into the conventional attention mecha-
nisms?
How to extend the attention mechanisms to graphs?
6.2.1 Context-aware Attention Mechanism
In conventional attention methods, the weights, which determine how much attention is
paid to each region, are obtained by comparing a reference feature with the features to be
attended. As illustrated in Figure 6.2, the feature of each region is separately compared
to the reference feature. Higher similarity to the reference feature yields higher weight to
88
the corresponding region. In the “person” attention example in Figure 6.3, the reference
feature is the feature of person. The regions containing people have higher similarity to
the reference, and therefore obtain higher weights.
Figure 6.2: Illustration of assigning weights in conventional attention methods.
Such design is effective for detecting objects of interest in an image. However, it is
not very helpful for understanding the image.
Let’s take image captioning as an example. Given attention weights focusing on the
object “person” visualized as heatmap in Figure 6.3, three people (two in front and one
in back) are mainly highlighted in the heatmap. It is natural that these three people are
not equally important when we want to describe the image in one sentence. Similarly,
the food in the hands of the woman on the left should not be equally important as the
food in the plate on the table. As a reference, one caption from the COCO dataset [63]
generated by human annotators is “2 women are posing with their food at an event”.
89
(a) Image (b) Attention heatmap
Figure 6.3: Attention visualization on the object “person”. Darker blue region denotes
for region assigned with higher weight.
Separately assigning weight to each region without context results in equal impor-
tance to similar objects. For better understanding of images, we should not treat each
region separately. Instead, we should incorporate context into the process of assigning
weights.
6.2.2 Attention Mechanism on Graphs
For a given graph, it is obvious that some nodes could be more important than others
for certain tasks. In these cases, extracting the local information of these important
nodes would be helpful. For example, for fashion outfit compatibility prediction, if
some items in a given outfit are not compatible with each other, the entire outfit would
be incompatible.
90
Figure 6.4: Illustration of manually masking out shared nodes for a pair of outfits in
Sec. 5.
In Sec. 5, we have constructed positive and negative outfit pairs and designed a local
loss function to help the network learn the subtle difference between compatible and
incompatible outfits. For the proposed local loss function, we manually masked out the
nodes shared in the graph pairs to let the network focus on the subtle difference (as
illustrated in Figure 6.4. With the help of the attention mechanism on graphs, we might
be able to automatically mask out less important nodes and conserve more information
from important nodes.
In [108], an architecture of graph attention layer is proposed, where the attention
mechanism is a single-layer feedforward neural network. Unlike the attention mecha-
nisms in convolutional neural networks where the output feature is a weighted sum of
input features, the number of nodes is not decreased by the proposed graph attention
layer. The output feature of each node is an aggregation of the features of neighboring
nodes, and the graph architecture is preserved.
Rather than simply aggregating information from neigbhoring nodes using the pro-
posed attention mechanism, we would like to evaluate the importance of each node to
91
the entire graph. Such results would be useful for the cases described above where we
would like to mask out some less important nodes.
92
Bibliography
[1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embed-
dings for fine-grained image classification. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.
Bottom-up and top-down attention for image captioning and visual question
answering. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6077–6086, 2018.
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and
D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE interna-
tional conference on computer vision, pages 2425–2433, 2015.
[4] Y . Bai, W. Ma, Y . Li, L. Cao, W. Guo, and L. Yang. Multi-scale fully convolu-
tional network for fast face detection. In BMVC, 2016.
[5] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
[6] J. Bruna, W. Zaremba, A. Szlam, and Y . LeCun. Spectral networks and locally
connected networks on graphs. In Proceedings of the 3rd International Confer-
ence on Learning Representations, 2014.
[7] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsuper-
vised learning of visual features. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 132–149, 2018.
[8] L. Chen and Y . He. Dress fashionably: Learn fashion collocation with deep
mixed-category metric learning. In Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
[9] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan. Deep domain
adaptation for describing people based on fine-grained clothing attributes. In
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on,
pages 5315–5324. IEEE, 2015.
93
[10] W. Chen, P. Huang, J. Xu, X. Guo, C. Guo, F. Sun, C. Li, A. Pfadler, H. Zhao,
and B. Zhao. Pog: Personalized outfit generation for fashion recommendation at
alibaba ifashion. In Proceedings of the 25th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data Mining, 2019.
[11] D. Cohen-Or, O. Sorkine, R. Gal, T. Leyvand, and Y .-Q. Xu. Color harmo-
nization. In ACM Transactions on Graphics (TOG), volume 25, pages 624–630.
ACM, 2006.
[12] G. Cucurull, P. Taslakian, and D. Vazquez. Context-aware visual compatibility
prediction. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 12617–12626, 2019.
[13] Z. Cui, Z. Li, S. Wu, X.-Y . Zhang, and L. Wang. Dressing as a whole: Outfit
compatibility learning based on node-wise graph neural networks. In The World
Wide Web Conference, pages 307–317. ACM, 2019.
[14] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and
D. Batra. Visual dialog. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, volume 2, 2017.
[15] H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. C. Courville.
Guesswhat?! visual object discovery through multi-modal dialogue. In CVPR,
volume 1, page 3, 2017.
[16] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. In Advances in neural information
processing systems, pages 3844–3852, 2016.
[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-
scale hierarchical image database. In Computer Vision and Pattern Recognition,
2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
[18] J. Deng and X. Xie. Nested shallow cnn-cascade for face detection in the wild. In
IEEE International Conference on Automatic Face & Gesture Recognition(FG),
pages 165–172. IEEE, 2017.
[19] Q. Dong, S. Gong, and X. Zhu. Multi-task curriculum transfer deep learning
of clothing attributes. In Applications of Computer Vision (WACV), 2017 IEEE
Winter Conference on, pages 520–529. IEEE, 2017.
[20]
ˇ
Z. Emerˇ siˇ c, V .
ˇ
Struc, and P. Peer. Ear recognition: More than a survey. Neuro-
computing, 2017.
94
[21] S. S. Farfade, M. J. Saberian, and L.-J. Li. Multi-view face detection using deep
convolutional neural networks. In 5th ACM on International Conference on Mul-
timedia Retrieval, pages 643–650. ACM, 2015.
[22] Z. Feng, Z. Yu, Y . Yang, Y . Jing, J. Jiang, and M. Song. Interpretable partitioned
embedding for customized fashion outfit composition. In Proceedings of the 2018
ACM on International Conference on Multimedia Retrieva (ICMR), pages 143–
151, 2018.
[23] D. Frejlichowski and N. Tyszkiewicz. The west pomeranian university of tech-
nology ear database–a tool for testing biometric algorithms. Image analysis and
recognition, pages 227–234, 2010.
[24] Z. Gan, Y . Cheng, A. E. Kholy, L. Li, J. Liu, and J. Gao. Multi-step reasoning
via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579,
2019.
[25] R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pages 1440–1448, 2015.
[26] L. Gong and Q. Cheng. Exploiting edge features in graph neural networks. arXiv
preprint arXiv:1809.02709, 2018.
[27] E. Gonzalez, L. Alvarez, and L. Mazorra. AMI ear database. http://www.
ctim.es/research_works/ami_ear_database/.
[28] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large
graphs. In Advances in Neural Information Processing Systems, pages 1024–
1034, 2017.
[29] X. Han, Z. Wu, Y .-G. Jiang, and L. S. Davis. Learning fashion compatibility with
bidirectional lstms. In Proceedings of the 25th ACM international conference on
Multimedia, pages 1078–1086. ACM, 2017.
[30] Z. Hao, Y . Liu, H. Qin, J. Yan, X. Li, and X. Hu. Scale-aware face detection. In
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[31] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[32] M. Henaff, J. Bruna, and Y . LeCun. Deep convolutional networks on graph-
structured data. arXiv preprint arXiv:1506.05163, 2015.
95
[33] S. C. Hidayati, K.-L. Hua, W.-H. Cheng, and S.-W. Sun. What are the fashion
trends in new york? In Proceedings of the 22nd ACM international conference
on Multimedia, pages 197–200. ACM, 2014.
[34] E. Hoffer and N. Ailon. Deep metric learning using triplet network. In Interna-
tional Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer,
2015.
[35] H. Hu, W.-L. Chao, and F. Sha. Learning answer embeddings for visual question
answering. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5428–5436, 2018.
[36] P. Hu and D. Ramanan. Finding tiny faces. In IEEE Conference on Computer
Vision and Pattern Recognition, 2017.
[37] Y . Hu, X. Yi, and L. S. Davis. Collaborative fashion recommendation: A func-
tional tensor factorization approach. In Proceedings of the 23rd ACM interna-
tional conference on Multimedia, pages 129–138. ACM, 2015.
[38] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, volume 1, page 3, 2017.
[39] J. Huang, R. Feris, Q. Chen, and S. Yan. Cross-domain image retrieval with a
dual attribute-aware ranking network. In Computer Vision (ICCV), 2015 IEEE
International Conference on, pages 1062–1070. IEEE, 2015.
[40] T. Iwata, S. Watanabe, and H. Sawada. Fashion coordinates recommender system
using photographs from fashion magazines. In Twenty-Second International Joint
Conference on Artificial Intelligence, 2011.
[41] V . Jain and E. Learned-Miller. Fddb: A benchmark for face detection in uncon-
strained settings. Technical report, Technical Report UM-CS-2010-009, Univer-
sity of Massachusetts, Amherst, 2010.
[42] F. Jiang, M. Fischer, H. K. Ekenel, and B. E. Shi. Combining texture and stereo
disparity cues for real-time face detection. Signal Processing: Image Communi-
cation, 28(9):1100–1113, 2013.
[43] Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminative dictionary for sparse
coding via label consistent k-svd. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 1697–1704, 2011.
[44] J. Jin, B. Xu, X. Liu, Y . Wang, L. Cao, L. Han, B. Zhou, and M. Li. A face detec-
tion and location method based on feature binding. Signal Processing: Image
Communication, 36:179–189, 2015.
96
[45] W.-C. Kang, E. Kim, J. Leskovec, C. Rosenberg, and J. McAuley. Complete the
look: Scene-based complementary product recommendation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 10532–
10541, 2019.
[46] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, and
H. Winnemoeller. Recognizing image style. In Bmvc, 2014.
[47] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Dis-
covering elements of fashion styles. In European conference on computer vision,
pages 472–488. Springer, 2014.
[48] J. Kim, T. Kim, S. Kim, and C. D. Yoo. Edge-labeling graph neural network for
few-shot learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 11–20, 2019.
[49] M. Kim, S. Kumar, V . Pavlovic, and H. Rowley. Face tracking and recognition
with visual constraints in real-world videos. In IEEE Conference on Computer
Vision and Pattern Recognition, 2008.
[50] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolu-
tional networks. arXiv preprint arXiv:1609.02907, 2016.
[51] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Associating neural word embed-
dings with deep image representations using fisher vectors. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 4437–
4446, 2015.
[52] S. Kottur, J. M. Moura, D. Parikh, D. Batra, and M. Rohrbach. Visual coreference
resolution in visual dialog using neural module networks. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 153–169, 2018.
[53] R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalan-
tidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and
vision using crowdsourced dense image annotations. International Journal of
Computer Vision, 123(1):32–73, 2017.
[54] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
[55] A. Kumar and C. Wu. Automated human identification using ear imaging. Pattern
Recognition, 45(3):956–968, 2012.
[56] D. Li, X. Wang, and D. Kong. Deeprebirth: Accelerating deep neural network
execution on mobile devices. arXiv preprint arXiv:1708.04728, 2017.
97
[57] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network
cascade for face detection. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 5325–5334, 2015.
[58] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky. Adversarial learning
for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
[59] Y . Li, L. Cao, J. Zhu, and J. Luo. Mining fashion outfit composition using an end-
to-end deep learning approach on set data. IEEE Transactions on Multimedia,
19(8):1946–1955, 2017.
[60] X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan. Deep
human parsing with active template regression. IEEE transactions on pattern
analysis and machine intelligence, 37(12):2402–2414, 2015.
[61] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and S. Yan. Human
parsing with contextualized convolutional neural network. In Proceedings of the
IEEE International Conference on Computer Vision, pages 1386–1394, 2015.
[62] T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar. Focal loss for dense object
detection. International conference on computer vision, 2017.
[63] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar,
and C. L. Zitnick. Microsoft coco: Common objects in context. In European
conference on computer vision, pages 740–755. Springer, 2014.
[64] Y . Lin, P. Ren, Z. Chen, Z. Ren, J. Ma, and M. De Rijke. Explainable out-
fit recommendation with joint outfit matching and comment generation. IEEE
Transactions on Knowledge and Data Engineering, 2019.
[65] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust
clothes recognition and retrieval with rich annotations. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 1096–
1104, 2016.
[66] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild.
In IEEE International Conference on Computer Vision, December 2015.
[67] Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang. Fashion landmark detection in
the wild. In European Conference on Computer Vision, pages 229–245. Springer,
2016.
[68] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra. Best of both worlds: Transfer-
ring knowledge from discriminative learning to a generative visual dialog model.
In Advances in Neural Information Processing Systems, pages 314–324, 2017.
98
[69] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive atten-
tion via a visual sentinel for image captioning. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), volume 6, page 2,
2017.
[70] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention
for visual question answering. In Advances In Neural Information Processing
Systems, pages 289–297, 2016.
[71] Z. Lu, Y . Hu, Y . Jiang, Y . Chen, and B. Zeng. Learning binary code for per-
sonalized fashion recommendation. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 10562–10570, 2019.
[72] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based
approach to answering questions about images. In Proceedings of the IEEE inter-
national conference on computer vision, pages 1–9, 2015.
[73] D. Massiceti, N. Siddharth, P. K. Dokania, and P. H. Torr. Flipdial: A generative
model for two-way visual dialogue. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018.
[74] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without
bells and whistles. In European Conference on Computer Vision, pages 720–735.
Springer, 2014.
[75] Y . Matsuda. Color design. Asakura Shoten, 2(4):10, 1995.
[76] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural
information processing systems, pages 3111–3119, 2013.
[77] G. A. Miller. Wordnet: a lexical database for english. Communications of the
ACM, 38(11):39–41, 1995.
[78] F. Monti, O. Shchur, A. Bojchevski, O. Litany, S. G¨ unnemann, and M. M.
Bronstein. Dual-primal graph convolutional networks. arXiv preprint
arXiv:1806.00770, 2018.
[79] T. Nakamura and R. Goto. Outfit generation and style extraction via bidirectional
lstm and autoencoder. arXiv preprint arXiv:1807.03133, 2018.
[80] K. Ning, M. Liu, and M. Dong. A new robust elm method based on a bayesian
framework with heavy-tailed distribution and weighted likelihood function. Neu-
rocomputing, 149:891–903, 2015.
99
[81] P. O’Donovan, A. Agarwala, and A. Hertzmann. Color compatibility from large
datasets. ACM Transactions on Graphics (TOG), 30(4):63, 2011.
[82] E. Ohn-Bar and M. M. Trivedi. To boost or not to boost? on the limits of boosted
trees for object detection. In Pattern Recognition (ICPR), 2016 23rd International
Conference on, pages 3350–3355. IEEE, 2016.
[83] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC,
volume 1, page 6, 2015.
[84] L. F. Polania and S. Gupte. Learning fashion compatibility across apparel cat-
egories for outfit recommendation. In 2019 IEEE International Conference on
Image Processing (ICIP), 2019.
[85] H. Qin, J. Yan, X. Li, and X. Hu. Joint training of cascaded cnn for face detection.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 3456–
3465, 2016.
[86] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In Advances in neural information pro-
cessing systems, pages 91–99, 2015.
[87] H. A. Rowley. Neural network-based face detection. Technical report, Carnegie-
Mellon Univ Pittsburgh PA Dept of Computer Science, 1999.
[88] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for
face recognition and clustering. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 815–823, 2015.
[89] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors
with online hard example mining. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 761–769, 2016.
[90] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in
fashion: Modeling the perception of fashionability. In CVPR, volume 2, page 6,
2015.
[91] E. Simo-Serra and H. Ishikawa. Fashion style in 128 floats: joint ranking and
classification using weak data for feature extraction. In Computer Vision and
Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 298–307. IEEE,
2016.
[92] M. Simonovsky and N. Komodakis. Dynamic edge-conditioned filters in convo-
lutional neural networks on graphs. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 3693–3702, 2017.
100
[93] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In International Conference on Learning Representations,
2015.
[94] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang. Exemplar-based face
parsing. In IEEE Conference on Computer Vision and Pattern Recognition, pages
3484–3491, 2013.
[95] X. Song, F. Feng, J. Liu, Z. Li, L. Nie, and J. Ma. Neurostylist: Neural compati-
bility modeling for clothing matching. In Proceedings of the 25th ACM interna-
tional conference on Multimedia, pages 753–761. ACM, 2017.
[96] Y . Song, Z. Zhang, L. Liu, A. Rahimpour, and H. Qi. Dictionary reduction:
Automatic compact dictionary learning for classification. In Asian Conference
on Computer Vision, pages 305–320. Springer, 2016.
[97] M. Suk and B. Prabhakaran. Real-time mobile facial expression recognition
system-a case study. In IEEE Conference on Computer Vision and Pattern Recog-
nition Workshops, pages 132–137, 2014.
[98] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104–
3112, 2014.
[99] C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the
inception architecture for computer vision. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pages 2818–2826, 2016.
[100] J. Tan, J. I. Echevarria, and Y . I. Gingold. Palette-based image decomposition,
harmonization, and color transfer. CoRR, abs/1804.01225, 2018.
[101] R. Tan, M. I. Vasileva, K. Saenko, and B. A. Plummer. Learning similarity con-
ditions without explicit supervision. In Proceedings of the IEEE International
Conference on Computer Vision, 2019.
[102] P. Tangseng and T. Okatani. Toward explainable fashion recommendation. arXiv
preprint arXiv:1901.04870, 2019.
[103] P. Tangseng, K. Yamaguchi, and T. Okatani. Recommending outfits from personal
closet. In Proceedings of the IEEE International Conference on Computer Vision,
pages 2275–2279, 2017.
[104] M. Tokumaru, N. Muranaka, and S. Imanishi. Color design support system con-
sidering color harmony. In 2002 IEEE World Congress on Computational Intel-
ligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02.
Proceedings (Cat. No. 02CH37291), volume 1, pages 378–383. IEEE, 2002.
101
[105] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective
search for object recognition. International Journal of Computer Vision (IJCV),
104(2):154–171, 2013.
[106] M. I. Vasileva, B. A. Plummer, K. Dusad, S. Rajpal, R. Kumar, and D. Forsyth.
Learning type-aware embeddings for fashion compatibility. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 390–405, 2018.
[107] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie. Learning visual
clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the
IEEE International Conference on Computer Vision, pages 4642–4650, 2015.
[108] P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Bengio. Graph
attention networks. arXiv preprint arXiv:1710.10903, 2017.
[109] P. Viola and M. J. Jones. Robust real-time face detection. International journal
of computer vision, 57(2):137–154, 2004.
[110] D. Wang, J. Yang, J. Deng, and Q. Liu. Facehunter: A multi-task convolutional
neural network based face detector. Signal Processing: Image Communication,
47:476–481, 2016.
[111] L. Wang, X. Yu, and D. N. Metaxas. A coupled encoder-decoder network for
joint face detection and landmark localization. In IEEE International Conference
on Automatic Face & Gesture Recognition(FG), pages 251–257, 2017.
[112] Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon.
Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics
(TOG), 38(5):146, 2019.
[113] T. A. Warm. Weighted likelihood estimation of ability in item response theory.
Psychometrika, 54(3):427–450, 1989.
[114] T.-H. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P.-H. Su,
S. Ultes, and S. Young. A network-based end-to-end trainable task-oriented dia-
logue system. arXiv preprint arXiv:1604.04562, 2016.
[115] Y . Wen, K. Zhang, Z. Li, and Y . Qiao. A discriminative feature learning approach
for deep face recognition. In European Conference on Computer Vision, pages
499–515. Springer, 2016.
[116] Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel. Are you talking to me?
reasoned visual dialog generation through adversarial learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
102
[117] T. Xiao, Y . Xu, K. Yang, J. Zhang, Y . Peng, and Z. Zhang. The application of
two-level attention models in deep convolutional neural network for fine-grained
image classification. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 842–850, 2015.
[118] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial
attention for visual question answering. In European Conference on Computer
Vision, pages 451–466. Springer, 2016.
[119] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y . Bengio. Show, attend and tell: Neural image caption generation with visual
attention. In International conference on machine learning, pages 2048–2057,
2015.
[120] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll parsing: Retrieving
similar styles to parse clothing items. In Proceedings of the IEEE international
conference on computer vision, pages 3519–3526, 2013.
[121] K. Yamaguchi, M. H. Kiapour, and T. L. Berg. Paper doll parsing: Retrieving
similar styles to parse clothing items. In Computer Vision (ICCV), 2013 IEEE
International Conference on, pages 3519–3526. IEEE, 2013.
[122] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. Parsing clothing in
fashion photographs. In Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on, pages 3570–3577. IEEE, 2012.
[123] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-
grained categorization and verification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3973–3981, 2015.
[124] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face
detection: A deep learning approach. In IEEE International Conference on Com-
puter Vision, pages 3676–3684, 2015.
[125] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face detection benchmark.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 5525–
5533, 2016.
[126] X. Yang, X. He, X. Wang, Y . Ma, F. Feng, M. Wang, and T.-S. Chua. Interpretable
fashion matching with rich attributes. In Proceedings of the 42nd International
ACM SIGIR Conference on Research and Development in Information Retrieval,
2019.
[127] X. Yang, Y . Ma, L. Liao, M. Wang, and T.-S. Chua. Transnfcm: translation-based
neural fashion compatibility modeling. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 403–410, 2019.
103
[128] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for
image question answering. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 21–29, 2016.
[129] R. Yin, K. Li, J. Lu, and G. Zhang. Enhancing fashion recommendation with
visual compatibility relationship. In The World Wide Web Conference, pages
3434–3440. ACM, 2019.
[130] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic
attention. In Proceedings of the IEEE Conference on computer vision and pattern
recognition, 2016.
[131] L.-F. Yu, S. K. Yeung, D. Terzopoulos, and T. F. Chan. Dressup!: outfit synthesis
through automatic optimization. ACM Trans. Graph., 31(6):134–1, 2012.
[132] H. Zhang, X. Wang, J. Zhu, and C.-C. J. Kuo. Accelerating proposal generation
network for fast face detection on mobile devices. In 2018 25th IEEE Interna-
tional Conference on Image Processing (ICIP), pages 326–330. IEEE, 2018.
[133] J. Zhang, X. Wang, D. Li, and Y . Wang. Dynamically hierarchy revolution: dirnet
for compressing recurrent neural network on mobile devices. IJCAI, 2018.
[134] K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. Joint face detection and alignment
using multitask cascaded convolutional networks. IEEE Signal Processing Let-
ters, 23(10):1499–1503, 2016.
[135] Z. Zhang, Y . Song, and H. Qi. Age progression/regression by conditional adver-
sarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recog-
nition, 2017.
[136] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark local-
ization in the wild. In IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2879–2886. IEEE, 2012.
104
Abstract (if available)
Abstract
Deep learning techniques utilize networks with multiple layers cascaded to map the inputs to desired outputs. To map the entire inputs to desired outputs, useful information should be extracted through the layers. During the mapping, feature extraction and prediction are jointly performed. We do not have direct control for feature extraction. Consequently, some useful information, especially local information, is also discarded in the process. ❧ In this thesis, we specifically study local-aware deep learning techniques from four different aspects: ❧ 1. Local-aware network architecture ❧ 2. Local-aware proposal generation ❧ 3. Local-aware region analysis ❧ 4. Local-aware supervision ❧ Specifically, we design a multi-modal attention mechanism for generative visual dialogue system in Chapter 2. The visual dialogue system holds a dialogue between human and machine. A generative visual dialogue system takes an image, a sentence in one round of dialogue and the dialogue in the past rounds as inputs, and generates the corresponding response to continue the dialogue. Our proposed local-aware network architecture is able to simultaneously attend to those multi-modal inputs and utilize extracted local information to generate dialogue responses. ❧ We propose a proposal network for fast face detection system for mobile devices in Chapter 3. A face detection system on mobile devices has many challenges including high accuracy, fast inference speed and small model size due to limited computation power and storage space of mobile devices. Our proposed local-aware proposal generation module is able to detect salient facial parts and use them as local cues for detection of entire faces. It accelerates the inference speed and does not result in much burden on model size. ❧ We extract representative fashion features by analyzing local regions in Chapter 4. Many fashion attributes, such as the shape of the collar, the length of the sleeves, the pattern of the prints, etc., can only be found in local regions. Our proposed local-aware region analysis extracts representative fashion features from different levels of the deep network, so that the fashion features extracted contain many local fashion details of human’s interests. ❧ We develop a fashion outfit compatibility learning method with local graphs in Chapter 5. When modeling a fashion outfit as a graph, the network that learns the compatibility on the entire outfit graphs only may ignore some subtle differences among outfits. Our proposed local-aware supervision includes the construction of local graphs and the corresponding local loss function. The local graphs are constructed from partial outfits. Then the network trained with the local loss function on the local graphs is able to learn the subtle difference of compatibility in fashion outfits data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Object localization with deep learning techniques
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
3D deep learning for perception and modeling
PDF
A data-driven approach to image splicing localization
PDF
Visual knowledge transfer with deep learning techniques
PDF
A deep learning approach to online single and multiple object tracking
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Exploring complexity reduction in deep learning
PDF
Efficient graph learning: theory and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Leveraging training information for efficient and robust deep learning
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Structured visual understanding and generation with deep generative models
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
Asset Metadata
Creator
Zhang, Heming
(author)
Core Title
Local-aware deep learning: methodology and applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/14/2020
Defense Date
01/21/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
)
Creator Email
h.m-zhang@hotmail.com,hemingzh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-270677
Unique identifier
UC11674243
Identifier
etd-ZhangHemin-8170.pdf (filename),usctheses-c89-270677 (legacy record id)
Legacy Identifier
etd-ZhangHemin-8170.pdf
Dmrecord
270677
Document Type
Dissertation
Rights
Zhang, Heming
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning