Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Labeling cost reduction techniques for deep learning: methodologies and applications
(USC Thesis Other)
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LABELING COST REDUCTION TECHNIQUES FOR DEEP LEARNING:
METHODOLOGIES AND APPLICATIONS
by
Yeji Shen
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2021
Copyright 2021 Yeji Shen
Acknowledgments
First of all, I would love to express my great gratitude to my Ph.D. advisor, Prof. C.-C.
Jay Kuo. Prof. Kuo is my role model for not only conducting research but also the life
attitude. His diligence, self-discipline, sense of responsibility and endless enthusiasm to
research are probably something that I may not be able to achieve in my life. It is my
honor to have Prof. Kuo as my advisor. I hope that I could be closer to Prof. Kuo in the
future.
Next, I would like to thank my parents, Jinxia Shen and Qiuhong Zhao. I cannot
imagine what my life would be like without their unconditioned support in the past
years. The video chats with my parents once per week witnessed my maturity as well as
their aging. I am proud of being their son and I really hope that they are also proud of
me as their son.
Then, I would like to thank my girl friend Jin Zhou. Jin is almost always the first
person when I have anything new that I would like to share. She comforted me when
I was upset. She encouraged me when I failed. I could not achieve what I have now
without her.
During the Ph.D. study, many people helped me in both research and daily life. My
labmates, Yuhang Song, Ye Wang, Haiqiang Wang, Kaitai Zhang, Bin Wang, Siyang Li,
Junting Zhang, Heming Zhang, Chihao Wu, my roommates Jianfeng Wang, Ce Yang
and He Jiang, my intern managers and colleagues in Google: Brian Potetz, Jianing Wei,
ii
Shahab Kamali, Pu-Chin Chen, Adel Ahmadyan, Tingbo Hou and Hanhan Li, as well
as Alvin Kerber and Qin Huang (who is also my labmate) in Facebook. It is my great
pleasure to work with them.
Besides, I would like to thank Dr. Michael Shindler, Dr. Aaron Cote, Dr. Sandra
Batista, Prof. Andreas Molisch and Dr. Arash Saifhashemi for the valuable TA experi-
ences with them, Prof. David Kempe, Justin Cheng, Sitan Gao, Changyu Zhu, Chenhui
Zhu and Pengda Xiang for the happy time about the programming contest. I wish them
all the best in their future life.
Finally, I would like to thank my qualifying and defense committee members: Prof.
Antonio Ortega, Prof. Aiichiro Nakano, Prof. Mahdi Soltanolkotabi, Prof. Keith Jenk-
ins. Their comments and suggestions were very helpful in my research.
iii
Contents
Acknowledgments ii
List of Tables vi
List of Figures viii
Abstract xii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 K-covers for Active Learning in Image Classification . . . . . . 3
1.2.2 TBAL: Two-stage Batch-Mode Active Learning . . . . . . . . . 4
1.2.3 Weakly Supervised Multi-Person 3D Human Pose Estimation . 5
1.2.4 Self-supervised Representation Learning of Images with Texts . 6
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 6
2 Research Background 8
2.1 Pool-based Active Learning . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Image Classification Datasets . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Self-supervised Representation Learning . . . . . . . . . . . . . . . . . 18
3 Active Learning: Balancing Uncertainty and Diversity 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Uncertainty Metrics . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 K-covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 TBAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Analysis of Batch Redundancy Problem . . . . . . . . . . . . . 35
iv
3.3.2 Experimental Setup of K-covers . . . . . . . . . . . . . . . . . 36
3.3.3 Experimental Results of K-covers . . . . . . . . . . . . . . . . 37
3.3.4 Experimental Setup of TBAL . . . . . . . . . . . . . . . . . . 41
3.3.5 Experimental Results of TBAL . . . . . . . . . . . . . . . . . . 44
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Weakly Supervised Human Pose Estimation 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Multi-View Matching Method . . . . . . . . . . . . . . . . . . 57
4.2.2 Multi-Person 3D Pose Estimation from Single Image . . . . . . 61
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 MVM Evaluation with Mannequin Dataset . . . . . . . . . . . 67
4.3.2 3DPW Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 MSCOCO Evaluation . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Self-supervised Representation Learning 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Input Signals and Downstream Baseline . . . . . . . . . . . . . 75
5.2.2 Self-Supervised Representation (SSR) . . . . . . . . . . . . . . 78
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Image Tag Prediction . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.3 Functional Image Classification . . . . . . . . . . . . . . . . . 82
5.3.4 Implementation Choices . . . . . . . . . . . . . . . . . . . . . 85
5.3.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusion and Future Work 87
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1 Quadratic Programming Based Active Learning . . . . . . . . . 89
6.2.2 Weak 3D Information Fusion on Frozen Videos . . . . . . . . . 91
6.2.3 General Purpose Text Supervised Representation Learning . . . 92
Bibliography 94
v
List of Tables
3.1 Settings used for the evaluation of pool based active learning. . . . . . . 37
3.2 Entropy and its corresponding label distribution that is used in the class-
unbalanced initialization experiments. . . . . . . . . . . . . . . . . . . 38
3.3 Summary of experimental settings, including datasets, CNN models and
hyper-parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Performance comparison between several matching algorithms, where
* means only evaluated on clips with less than 50 frames. Otherwise,
it would take too much time to run Dong’s algorithm. Smaller recon-
struction errorE
R
implies more consistent 3D poses. A larger number
of corresponding 2D keypointsjG
k
j as well as a smaller number of out-
liers means a more robust 3D reconstruction process and thus potentially
better 3D poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Comparison among different affinity matrices. . . . . . . . . . . . . . . 68
4.3 Evaluation of estimated 3D poses for the MPII-3DPW dataset. The
upper part indicates the results of bottom-up based methods and the
lower part indicates the results of top-down based methods . . . . . . . 70
4.4 Evaluation of estimated 2D poses in the COCO validation set. . . . . . 70
4.5 Comparison between two 2D pose estimation networks, where we report
the variance,C
var
, of the pairwise triangulated 3D poses, which is used
to measure consistency of predicted 2D poses, and is the threshold to
eliminate outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Comparison of image tag prediction results, where “V”, “T” and “S”
stand for visual signals, OCR signals and SSR signals obtained by SSRIT,
respectively. Two kinds of pre-computed visual signals are tested. They
are ResNet-50 and BiT-M (BiT). The best and second-best results are
shown in bold and with a underline, respectively. . . . . . . . . . . . . 83
5.2 Functional image classification results. “V”, “T” and “S” stand for
visual signals, textual signals and SSR signals, respectively. BiT is used
as visual signals. The best and second-best results are shown in bold
and with a underline, respectively. . . . . . . . . . . . . . . . . . . . . 84
5.3 FunctionalFlickr results on different representation network architecture. 85
vi
5.4 FunctionalFlickr results on different extraction choices of the represen-
tation features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 ImageNet-1k classification performance. . . . . . . . . . . . . . . . . . 86
vii
List of Figures
1.1 Examples from ImageNet [96], a large-scale image classification dataset
with more than 1.3M images. . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples of expensive unit labeling cost. The first row indicates fine-
grained image classification where the annotators need to differentiate
images of bird species. The second row indicates medical image classi-
fication where the annotators need to tell if a tumor is benign or malig-
nant. The third row indicates human pose estimation where the annota-
tors need to label the skeleton of a person in an image. . . . . . . . . . 2
2.1 An illustration of pool-based active learning. The final labeled set is
queried in multiple rounds. In each round of data acquisition, a batch
of unlabeled samples is selected and merged to labeled set after getting
them labeled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 An example of the model used in active learning. The feature function
and the distribution function are the two key components a model can
offer to the active learning strategy. In this example, the backbone net-
work is VGG [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 An illustration of the general idea of uncertainty sampling. Uncertainty
based AL methods will choose samples with the largest uncertainty val-
ues, which can be interpreted as selecting samples that are closest to the
decision boundary of some sort. . . . . . . . . . . . . . . . . . . . . . 10
2.4 An illustration of the general idea of diversity sampling. The core idea
of diversity sampling is to select a batch of samples that can somewhat
represent the whole dataset. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Structure of two common semi-supervised learning methods, -model
and temporal ensembling. The picture is from [61]. . . . . . . . . . . . 12
2.6 Examples of images in CIFAR-10 dataset [60]. There are ten classes
of images in this dataset. The ten classes are mainly about animals and
common objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Examples of the CUB-200-2011 fine-grained image classification dataset
[115]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Examples of the SVHN digits recognition dataset [84]. . . . . . . . . . 15
viii
2.9 Architecture of PifPaf [59], a state-of-the-art bottom-up multi-person
2D human pose estimation network. The overall framework has two
key components, an encoder network consisting of a ResNet based fea-
ture extractor and their proposed PIF and PAF modules, followed by a
decoder network that can convert predicted map representation of 2D
keypoints to vector representation. . . . . . . . . . . . . . . . . . . . . 17
3.1 An illustration of the problem of existing methods. Selecting samples
based on an uncertainty method or an diversity method alone may lead
to a batch that is not informative enough. . . . . . . . . . . . . . . . . . 21
3.2 Visualization of the selected batch sampled by different uncertainty met-
rics. Both tested uncertainty metrics have the redundancy problem. . . . 36
3.3 Visualization of the selected batch sampled by k-covers and uncertainty
sampling. K-covers can select a more balanced batch of unlabeled data,
which are also of high uncertainty (near the decision boundary of two
or more labels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Active learning curves. We can observe a significant drop of perfor-
mance for all uncertainty methods (solid red/green/blue line), when the
initial distribution is unbalanced, while our method i.e. the dotted lines
can still perform well. Our method can also outperform the two geome-
try methods (solid purple/orange line) with a large margin. . . . . . . . 39
3.5 Distribution produced by different data acquisition functions during the
active learning process. X-axis indicates the steps. Y-axis indicates the
distribution where different color means different labels (they means 0
to 9 from bottom to top). The labels are based on the ground truth. These
distributions are drawn from a random trial of active learning in MNIST
when the initial distribution is extremely unbalanced (entropy=0.8). . . 40
3.6 The protocol of evaluating query strategies in supervised and semi-supervised
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Comparison of accuracy performance curves for three datasets: (a) SVHN,
(b) CIFAR-10, and (c) CUB-200-2011. The two rows of figures rep-
resent semi-supervised accuracy and transferred supervised accuracy,
respectively. The performance curves of a random baseline, the diversity-
only baseline, the uncertainty-only baseline, a simple combination method
and our CRB-based method are shown in blue, orange, green, purple and
red colors, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Comparison of the mean performance gain of different versions of the
proposed TBAL method on SVHN, CIFAR-10 and CUB in terms of
semi-supervised accuracy (left) and transferred supervised accuracy (right).
The TBAL method with MCVarr and CRB (red bars) achieves the best
performance in both settings. . . . . . . . . . . . . . . . . . . . . . . . 46
ix
3.9 Performance comparison between semi-supervised training methods with
different uncertainty functions, where the three rows stand for three
uncertainty functions, three columns indicate three datasets, and solid
and dotted curves represent CEAL-based and TBAL-based semi-supervised
training methods, respectively. . . . . . . . . . . . . . . . . . . . . . . 47
3.10 Comparison of averaged accuracy based on different uncertainty func-
tions, where individual performance curves are shown in Fig. 3.9. . . . 48
3.11 Performance omparison of TBAL with two state-of-the-art methods,
V AAL and LL4AL, and the random selection baseline for SVHN, CIFAR-
10 and CUB-200-2011 three datasets. . . . . . . . . . . . . . . . . . . 49
3.12 Comparison of the averaged performance gain for V AAL, LL4AL and
TBAL, where semi-supervised accuracy and transferred supervised accu-
racy are shown in the left and the right subfigures, respectively. . . . . . 50
4.1 Exemplary video clips from the Mannequin dataset, where action-frozen
people keep certain poses in diverse environments such as cafe, school,
living room, hall, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Visualization of estimated camera poses: exemplary image frames in
two video clips (left) and the corresponding estimated camera pose tra-
jectory curves in red (right). . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Illustration of the generation of multi-person 3D poses by the proposed
MVM method. The input contains a sequence of image frames (shown
in the first block), 2D pose estimation at each frame (shown in the sec-
ond block), building correspondence of 2D poses across multiple frames
(shown in the third block), recovering 3D poses from matched 2D poses
(shown in the fourth block). . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 The proposed CNN for multi-person 3D pose estimation from a single
image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 The skeleton representation of human poses in our framework. . . . . . 63
4.6 Visualization of the Mannequin dataset results: (1st row) original images
with 2D poses, (2nd row) generated 3D poses. The first column indi-
cates a successful estimation of 3D poses in the scene. Due to the low
confidence score (e.g., the orange girl in the third column) or a small
parallax angle which can cause the failure of 3D reconstruction (e.g.,
some part of the joints associated with the orange-and-pink person in
the second column), the MVM method filters out such instances in a
scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Example images of the two tasks: image tag prediction (first row) and
functional image classification (second row). Underlined tags in the first
row can be also found in OCR results. . . . . . . . . . . . . . . . . . . 73
x
5.2 Our framework is based on the assumption that images that share the
same detected textual keywords (e.g. images with word “menu”) are
semantically similar. Thus, we can train a representation model that
learns image contents to capture information like “what images are likely
to have word ‘menu’ on it” and use such information to improve down-
stream models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 The downstream baseline for image classification with both visual and
textual inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Self-supervised representation training with textual input. . . . . . . . . 79
5.5 Training and evaluation of an improved downstream pipeline using SSR
features as additional input. . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 Examples of images with high OCR scores in three rows: calendar,
menu and passport. The boxed images in red color indicate mislead-
ing images with detected OCR keyword but are not semantically related. 81
5.7 Comparison of the image tag prediction performance between models
with different usage of textual signals, where ‘V”, “T” and “S” stand for
visual signals, textual signals and SSR signals, respectively. . . . . . . 83
5.8 Comparison of the image tag prediction performance between models
without direct usage of textual signals. We may assume our trained SSR
signals captures most “useful” information on the original visual signals
from the observation that orange bar (SSR signals only) is better than the
green bar (combining SSR signals and visual signals). . . . . . . . . . . 84
6.1 Solving MIQP with cutting plane method. It can converge in 10 iterations. 90
6.2 Active learning curve of our IQP formulation. . . . . . . . . . . . . . . 91
6.3 Rich information provided by 3D reconstruction software COLMAP.
Currenlty, only the estimated camera poses are utilized, which is poten-
tially a waste of resources. . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Captioned images from MSCOCO [112] (left) and TextVQA [104] (right). 93
xi
Abstract
Deep learning has contributed to a significant performance boost of many computer
vision tasks. Still, the success of most existing deep learning techniques relies on a
large number of labeled data. While data labeling is costly, a natural question arises:
is it possible to achieve better performance with the same budget of data labeling? We
provide two directions to address the problem: more efficient utilization of the budget
or supplementing unlabeled data with no labeling cost. Specifically, in this dissertation,
we study three problems related to the topic of reducing the labeling cost: 1) active
learning that aims at identifying most informative unlabeled samples for labeling; 2)
weakly supervised 3D human pose estimation that utilizes a special type of unlabeled
data, action-frozen people videos, to help improve the performance with few manual
annotations; and 3) self-supervised representation learning on a large-scale dataset of
images with text and user-input tags at no additional labeling cost.
Active learning is a technique that deals with abundant unlabeled data with limited
labeling resources. Instead of labeling the entire dataset, active learning methods count
on a certain query strategy that chooses a specific set of samples from the unlabeled
data pool and then ask annotators to label them. One common type of query strategies
is to choose samples with the largest uncertainty. Such uncertainty based methods may
suffer from the redundancy of the selected batch. In order to overcome the batch redun-
dancy problem, we propose two methods, k-covers and TBAL to address the problem.
xii
Both two proposed methods rely on an assumption of combining uncertainty and diver-
sity can improve the performance. Extensive experiments in public datasets show that
our proposed methods can outperform the existing active learning methods with a clear
margin.
The second research problem, human pose estimation problem, is a long-standing
problem in computer vision research. We mainly focus on multi-person 3D human
pose estimation case. The main challenge of multi-person 3D human pose estimation
is lack of labeled data. The acquisition of 3D labels is hindered by inadequate tools to
obtain reliable 3D information for monocular images. We propose Multi-View Matching
(MVM), a method to obtain high quality 3D skeletons from multiple 2D poses obtained
in each frame of action-frozen people video, which can be downloaded from Youtube
with basically no cost. With the help of collected Youtube videos, we use the MVM
method to generate weak 3D supervisions to train a multi-person 3D human pose esti-
mation neural network. Experiments show that our proposed solution offers the state-
of-the-art performance on the 3DPW and the MSCOCO datasets.
In the third part, we focus on the utilization of textual information in images. Text
information inside images could provide valuable cues for image understanding. We
propose a simple but effective representation learning framework, called the Self-Supervised
Representation learning of Images with Texts (SSRIT). SSRIT exploits optical character
recognition (OCR) signals in a self-supervision manner. SSRIT constructs a represen-
tation that is trained to predict whether the text in the image contains particular words
or phrases. This allows us to leverage unlabeled data to uncover the non-textual visual
features shared by images that contain similar text. The SSRIT representation is bene-
ficial to image tag prediction and functional image classification. In both tasks, SSRIT
outperforms baseline models with no OCR information as well as models that consume
OCR with no self-supervised representation.
xiii
Chapter 1
Introduction
1.1 Significance of the Research
Deep learning focuses on machine learning models with hierarchical architectures to
learn and represent high-level abstractions from data. Even though in the recent con-
text of computer vision papers where the term deep learning basically refers to con-
volutional neural networks (CNNs), there are actually a wide range of computational
models in the field of deep learning, including neural networks, hierarchical probabilis-
tic models, and a variety of unsupervised and supervised feature learning algorithms
[114]. There are three important reasons contributing to the recent surge of interest in
deep learning: drastically improved hardware processing ability, considerable advances
of machine learning algorithms and abundance of various kinds of data from different
sources, e.g. images, videos, social graphs, etc [32].
Figure 1.1: Examples from ImageNet [96], a large-scale image classification dataset
with more than 1.3M images.
1
Due to the considerable capacity of representation ability, deep learning models ben-
efit from large scale dataset. Many recent computer vision methods initialize their mod-
els with ImageNet pre-trained parameters. On the other hand, a large number of labeled
data are required in network training for a deep learning model to perform well. It
demands efforts to collect data. The recent development of internet offers a much easier
way of collecting unlabeled data. For example, it is not hard to collect a huge number of
images through the search engine or web crawler. However, it is much more expensive
to annotate collected images for targeted applications.
Figure 1.2: Examples of expensive unit labeling cost. The first row indicates fine-
grained image classification where the annotators need to differentiate images of bird
species. The second row indicates medical image classification where the annotators
need to tell if a tumor is benign or malignant. The third row indicates human pose
estimation where the annotators need to label the skeleton of a person in an image.
Some applications in computer vision require labels that are hard to obtain. As
shown in Figure 1.2, the three rows of examples shows that the labeling cost could be
pretty high when the target to label is hard-to-differentiate, some domain knowledge is
required or there are too many points to annotate in a single image, respectively. Besides
the tasks that have high unit cost, the labeling cost is inevitably expensive when a large
amount of data are needed to get labeled. According to Amazon Mechanical Turks, a
2
service to crowd-source labor intensive jobs such as labeling tasks, it may cost more
than 14,580 USD to label the sentiment of a dataset of 85,000 articles.
It is a challenging task to design algorithms and models to reduce the labeling cost
that can work in the scenarios of deep learning. In general, there are two main direc-
tions of reducing the labeling cost in order for a deep learning model to perform well:
selecting more informative samples to get labeled or making use of unlabeled data that
have basically no cost. The former is the main objective of active learning and the latter
is the goal of weakly-supervised / semi-supervised methods.
In this dissertation, we first focus on improving the performance of active learning by
investigating methods to select more informative samples. Two methods are proposed
to achieve that objective. Then we switch our goal to the exploration of techniques
to leverage a special kinds of unlabeled data: action-frozen videos. Specifically, we
propose to use the weak 3D supervision generated from multiple frames of the action-
frozen people videos for the training of multi-person 3D human pose estimation. Finally,
we design a self-supervised model to exploit text information in images. Our proposed
representation network can be trained on a large-scale dataset with no additional labeling
cost with the help of user-input tags associated with each image.
1.2 Contributions of the Research
1.2.1 K-covers for Active Learning in Image Classification
Active learning can offer useful techniques to reduce the labeling cost. Batch redun-
dancy is a crucial problem in active learning. We propose k-covers, a clustering based
query strategy, to address the redundancy problem.
• Through the visualization results, we verify the batch redundancy problem for
uncertainty based active learning startegies.
3
• Experiments show that our method can outperform the uncertainty based baseline
query strategy and two other diversity based query strategy with a clear margin on
two public datasets: MNIST and CIFAR-10. More specifically, uncertainty based
methods do not work well when the initial set is not balanced, while our method
is still robust.
• Our method can be used as a post-processing module and combined with any
uncertainty based query strategy. In the experiments, we prove that our method
successfully improve the overall accuracy when combining our method with three
common uncertainty metrics.
1.2.2 TBAL: Two-stage Batch-Mode Active Learning
Similar to the general goal of active learning as introduced in Sec. 1.2.1, our proposed
TBAL aims at improving the informativeness of the selected batch. The main objec-
tive of TBAL is to overcome the two major drawbacks in k-covers: a large number of
unlabeled data being ignored and the defect when the batch size is too small.
• A two-stage batch-mode active learning (TBAL) method that combines semi-
supervised learning and batch mode sampling is proposed to achieve consistently
better performance in experimental evaluations on the SVHN, CIFAR-10 and
CUB-200-2011 datasets.
• A new active-learning evaluation protocol is proposed. It can measure the perfor-
mance of the query strategy of different active learning methods in both semi- and
full-supervised settings.
• We develop a novel and effective sampling strategy that balances uncertainty and
diversity.
4
1.2.3 Weakly Supervised Multi-Person 3D Human Pose Estimation
The major reason that stops 3D human pose estimation from being practical is the lack
of large-scale in-the-wild dataset. To tackle the challenging problem of obtaining 3D
pose labels in natural unconstrainted scenes, we propose a multi-view matching (MVM)
method to generate reliable 3D human poses from a large-scale video dataset, called the
Mannequin dataset, that contains action-frozen people imitating mannequins. With a
large amount of in-the-wild video data labeled by 3D supervisions automatically gen-
erated by MVM, we are able to train a neural network that takes a single image as the
input for multi-person 3D pose estimation.
• We develop an efficient method, called the MVM method, that leverages geomet-
ric constraints and appearance similarities existing in video clips of static scenes.
for reliable 3D human poses estimation.
• We collect a large number of Youtube video clips containing action-frozen people
with in-the-wild scenes from Mannequin Challenge dataset [67] and build a new
Mannequin dataset. It can be used as a training dataset for multi-person 3D pose
estimation.
• Although there is no groundtruth, we use the MVM method to generate weak 3D
supervisions for the collected videos so that our Mannequin dataset can be used
to train a multi-person 3D neural network that is applied to a single image.
• We use the Mannequin dataset as the training dataset and show the effectiveness
of the weak 3D supervisions provided by MVM through extensive experimental
results conducted for the 3DPW dataset and the 2D MSCOCO dataset.
5
1.2.4 Self-supervised Representation Learning of Images with Texts
Text information in images can be efficiently extracted by optical character recognition
(OCR) techniques. Despite the fact that there have been a considerable number of papers
on OCR engine, there is less work on leveraging extracted text data for image under-
standing. In our work, we aim at exploiting the extracted text information to improve the
accuracy of two image understanding tasks: 1) image tag prediction and 2) functional
image classification.
• We propose a simple yet effective representation learning framework, called
SSRIT, whose training benefits from the use of textual data that are associated
with the training images naturally.
• We obtain consistent performance improvement with the SSRIT model regardless
whether the textual signals are present in test images or not.
• Our proposed model is built on top of pre-computed visual signals, which can
significantly reduce the computation resources needed during the training stage of
the model.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we review the research
background, including active learning, semi-supervised learning, image classification
datasets and human pose estimation. In Chapter 3, we propose two methods, k-covers
and TBAL, to explore the techniques of balancing diversity and uncertainty in active
learning. In Chapter 4, we propose MVM, a framework to generate weak labels for the
multi-person 3D human pose estimation network. In Chapter 5, wo propose SSRIT, a
model to distill extracted text information in a representation network which can be used
6
to facilitate downstream classifiers. Finally, concluding remarks and future research
directions are given in Chapter 6.
7
Chapter 2
Research Background
2.1 Pool-based Active Learning
Figure 2.1: An illustration of pool-based active learning. The final labeled set is queried
in multiple rounds. In each round of data acquisition, a batch of unlabeled samples is
selected and merged to labeled set after getting them labeled.
Pool-based framework is the most commonly used paradigm for active learning
problem. An illustration is shown in Figure 2.1. In pool based active learning, the
final labeled set is constructed in an iterative manner. In each round of data acquisition,
a batch of unlabeled samples is selected based on the model trained with current labeled
set as well as the optional unlabeled set. The selected batch will be merged to labeled
set after querying the oracle to get them labeled.
The deep learning based model in pool-based active learning can usually provide
two key components: feature functionF (x) and distribution functionp(y = cjx). As
shown in Figure 2.1 illustrating the network architecture of VGG network [103], the
feature function is represented by the output of the last fully connected layer before the
softmax and the distribution function is estimated by the softmax output of the network.
8
The choice of which layer’s response as the feature function may vary for different
network architecture but such choice is agnostic to the further decision of active learning
strategies.
Figure 2.2: An example of the model used in active learning. The feature function and
the distribution function are the two key components a model can offer to the active
learning strategy. In this example, the backbone network is VGG [103].
Deep learning campatible AL strategies. The emergence of deep learning brings
new challenges to active learning. It was shown in [100] that some traditional methods
[49, 110, 9] do not scale well for CNNs. There are three major types of active learn-
ing approaches: uncertainty based, diversity based and combined uncertainty/diversity
based.
For uncertainty-based methods, Gal et al. [29] proposed to use the Bayesian frame-
work to estimate uncertainty by approximating the posterior distribution of the uncer-
tainty function using the averaged outputs with dropout [106]. Along this direction,
Beluch et al. [6] used an ensemble method to estimate the posterior distribution. How-
ever, the ensemble-based method may suffer from lack of diversity of selected samples
[81].
For diversity-based methods, Sener and Savarese [100] converted the active learning
problem to a core-set problem and showed that the estimation error is bounded by the
9
Figure 2.3: An illustration of the general idea of uncertainty sampling. Uncertainty
based AL methods will choose samples with the largest uncertainty values, which can
be interpreted as selecting samples that are closest to the decision boundary of some
sort.
radius of a core set. Nguyen and Smeulders [85] exploited the cluster information to
avoid repetitive sampling in the same cluster in binary classification.
For combined uncertainty/diversity-based methods, a hybrid framework that lever-
ages uncertainty and information density was proposed in [66]. Furthermore, Yang et
al. [125] combined the core-set concept and uncertainty for a biomedical image classi-
fication application. An integer programming formulation was proposed in [24, 117] to
balance uncertainty and diversity. The adversarial training idea was also incorporated in
active learning [105].
Batch-mode active learning. Batch-mode active learning selects a batch of instances to
label at each iteration instead of choosing one at a time. It is sometimes called the pool-
based active learning. The effectiveness of batch-mode active learning was theoretically
proven in [28].
Although quite a few recent researchers [6, 126, 29] conducted their experiments in
batch mode, their methods actually did not have a special mechanism for batch mode
sampling. The batch was formed by selecting those instances with the top estimated
10
Figure 2.4: An illustration of the general idea of diversity sampling. The core idea of
diversity sampling is to select a batch of samples that can somewhat represent the whole
dataset.
individual scores for computational efficiency. As compared with those papers, some
earlier work, e.g., [40, 31, 12, 102] studied how multiple instances in a batch interact
with each other and how to handle the potential redundancy problem within a batch.
Brinker [9] assumed that selected samples approximately bisect the feature space
into two halves of equal volume and took the diversity of the selected batch into account.
However, his framework does not apply to the multi-class case. To handle the multi-class
case, Guo [31] proposed to use the mutual information between labeled instances and
unlabeled instances to be selected to measure informativeness of a batch of samples.
Then, he converted the mutual information maximization problem to a matrix partition
problem. Finally, he used the Gaussian Process to find a feasible solution. Chattopad-
hyay et al. [12] estimated some sort of marginal probability and then selected batches
of query samples that represent best the distribution of unlabeled instances.
Yet, the above-mentioned three methods do not scale well when the pool size is large
or the dimension of the feature space is high. As a result, existing batch-mode active
11
learning methods cannot be efficiently applied to a CNN-based model, which is the main
interest of our current research.
2.2 Semi-supervised learning
How to use the information from unlabeled data is a core issue of many machine learn-
ing problems. This list includes: semi-supervised learning [133], domain adaptation
[132], and unsupervised learning [71]. Rasmus et al. [94] proposed the use of a ladder
network to train a decoder that produces identical results for unlabeled samples with
noise in each layer of the encoder. Laine and Aila [61] proposed the -model and tem-
poral ensembling to refine the consistency-based regularization idea. It demands the
network to have consistent responses in the neighborhood of unlabeled samples. Luo et
al. [74] extended the regularization idea by taking into account the latent graph formed
by pseudo labels of unlabeled samples.
Figure 2.5: Structure of two common semi-supervised learning methods, -model and
temporal ensembling. The picture is from [61].
There have been a number of existing works that try to combine semi-supervised
learning idea with active learning. Hoi et al. [41] proposed to use a min-max
optimization-based method where both labeled data and unlabeled data were taken into
12
account. Since their method was optimized for the SVM classifier, it does not apply
to our situation. To develop a deep-learning-based active learning method, Wang et
al. [117] incorporated a threshold-based semi-supervised learning idea and applied a
simple uncertainty sampling to the trained model. Sinha et al. [105] proposed to lever-
age the adversarial idea and allow their to learn from both labeled and unlabeled data.
Their strategy is to train a Variational Autoencoder (V AE), which learns from a process
of discriminating unlabeled and labeled data, and cascade it with a discrimination net-
work. Then, samples that have the smallest probability to be predicted as labeled data
are selected for labeling since they are least representative for the current labeled set.
2.3 Image Classification Datasets
Figure 2.6: Examples of images in CIFAR-10 dataset [60]. There are ten classes of
images in this dataset. The ten classes are mainly about animals and common objects.
13
Large-scale annotated image datasets have been instrumental in driving the progress
of image classification in the last decade. CIFAR-10 is one of the most commonly used
large-scale image classification dataset. Examples of images in CIFAR-10 dataset can
be found in Figure 2.6. There are 50,000 training images and 10,000 testing images in
CIFAR-10 dataset. The resolution of images is 32 by 32.
Figure 2.7: Examples of the CUB-200-2011 fine-grained image classification dataset
[115].
Unlike regular image classification datasets that contain a wide variety of classes
such as different kinds of animals and objects, fine-grained datasets aim at subordinate
category classification where the target object classes are usually very close and some-
times difficult to distinguish even for human beings. For example, CUB-200-2011 is a
dataset that contains 200 classes of bird species as shown in Fig. 2.7. We will use this
dataset to demonstrate the capability of our proposed active learning strategy.
Digits recognition is another special type of image classification task. Recognizing
a digit is usually regarded as a easier problem compared to the classification of object
classes like cats or dogs. In practice, much less data is needed to acheive a reasonable
14
Figure 2.8: Examples of the SVHN digits recognition dataset [84].
performance for digits recognition task compared to regular image classification. SVHN
is a large scale digits recognition dataset which consists of 73,257 training and 26,032
testing images of house numbers plates on the street. Some examples can be found in
Figure 2.8.
2.4 Human Pose Estimation
3D Human Pose Estimation. There are two common representations for 3D human
poses: the skeleton-based representation and the parametric-model-based representa-
tion. The skeleton-based representation serves as the minimal representation of human
poses. It finds applications such as in sports motion analysis [87], among many oth-
ers. One approach to skeleton-based 3D poses estimation is the use of the 2D-to-3D
lifting mechanism. Martinez et al. [79] proposed a method that uses a cascade of
fully-connected (FC) layers to infer 3D skeletons from 2D skeletons. Recently, Ci et
al. [15] adopted a modified graph neural network to achieve the same goal. Yang et
15
al. [124] proposed a data augmentation method to synthesize virtual candidate poses,
which improves performance consistently.
Besides using RGB images as the input, the skeleton-based representation can also
be applied to range images for 3D human pose estimation. For example, Marin-Jimenez
et al. [77] examined a convolutional neural network (CNN) based method that esti-
mates 3D human poses from depth images directly. Zhang et al. [128] proposed a
clustering based method with hybrid features by integrating the geodesic distance and
the superpixel-based mid-level representation.
The parametric-body-model-based representation offers richer information of the
human body. Early work used the SCAPE model [2] that fits a body model to annotated
2D keypoints in the image. More recently, the SMPL model [72] becomes popular.
Kanazawa et al. [52] proposed a method that fits the SMPL model by minimizing the
re-projection error with respect to the 2D keypoint annotations while taking the human
pose prior such as joint angles and shape parameters into account. Lassner et al. [63]
included the 2D silhouette projection in the loss function for data with segmentation
mask annotations (e.g., the MSCOCO dataset).
Although they are useful in several contexts, both representations cannot be used in
our multi-person 3D pose estimation framework directly. For the former, it is not easy
to generalize the skeleton model to tackle with a varying number of people in an image.
For the latter, the parametric model is too heavy for multi-person 3D pose estimation
since it has too many parameters. It is worthwhile to mention that there exists work that
targeted at obtaining 3D human poses directly. For example, Papandreou et al. [88]
showed that a coarse-to-fine volumetic representation is better than the simplest K-by-3
vector representation, where K is the number of keypoints of the skeleton. A similar
conclusion was observed in [108], which further proposed an integral regression on 3D
skeletons.
16
Figure 2.9: Architecture of PifPaf [59], a state-of-the-art bottom-up multi-person 2D
human pose estimation network. The overall framework has two key components, an
encoder network consisting of a ResNet based feature extractor and their proposed PIF
and PAF modules, followed by a decoder network that can convert predicted map repre-
sentation of 2D keypoints to vector representation.
Multi-person human pose estimation. To solve the problem of multi-person pose esti-
mation from a single image, we can categorize current methods into two types: the top-
down approach [88, 36, 111, 118, 86, 127, 123] and the bottom-up approach [59, 10].
The top-down approach runs a person detector and then estimates body joints in the
detected bounding box. The associated methods benefit from advances in person detec-
tors and a vast amount of labeled bounding boxes for people. The ability to leverage
the labeled data turns the requirement of a person detector into an advantage. Yet, when
multiple bounding boxes overlap, most single-person pose estimators do not work well.
Unfortunately, there are many such cases in the Mannequin dataset of our interest so that
the top-down approach is not applicable. In contrast with the top-down approach, the
bottom-up approach does not rely on a person detector. Kreiss et al. [59] first estimated
each body joint and, then, grouped them to form a unique pose by a method called the
Parts Association Field (PAF). The architecture of PifPaf [59] is shown in Figure 2.9.
Human pose tracking. A single-frame multi-person pose estimator cannot ensure con-
sistency of identity across frames. Depending on how multiple frames are utilized, we
categorize pose tracking methods into two types: offline methods and online methods.
To address the identity correspondence problem across frames, Many offline methods
[44, 45, 120, 129] were proposed to enforce temporal consistency of poses in video
17
clips. Since it usually demands the solution of some difficult-to-optimize formula in
form of spatial-temporal graphs, their solution speed is slow.
For online methods, a common technique to handle the multi-person identification
problem across frames is to maintain temporal graphs in neural networks [20, 30, 99].
Rohit et al. [30] proposed a 3D extension of Mask-RCNN, called person tubes, to
connect people across time. Yet, its tracking result is no better than the simple base-
line Hungarian Algorithm [45] even more time and memory are needed to support the
grouped convolution operations in the implementation for a couple of frames. Joint
Flow [22] exploited the concept of the Temporal Flow Field to connect keypoints across
two frames. However, the flow-based representation suffers from ambiguity when sub-
jects moved slowly (or being stationary). It demands special handling of such cases in
the tracking process. Apparently, this method is not applicable to the Mannequin dataset
that contains action-frozen people.
2.5 Self-supervised Representation Learning
Text-assisted visual representation learning. Using the text information to train a
visual representation model has a long history. Tracing back to 20+ years ago, Mori et
al. [82] proposed to train an image retrieval model that predicts a set of words that are
paired with images. Other traditional machine learning models such as manifold learn-
ing [92] and Deep Boltzmann Machines [107] were also studied and found to be useful
in predicting words in image captions and/or user-defined tags. In recent years, text-
assisted visual representation learning for general image classification tasks under var-
ious settings (e.g., full finetune transfer, linear finetune transfer and zero-shot/few-shot
transfer) was investigated in [51, 65, 93]. CNN features were introduced by Joulin et al.
18
[51] for text-based visual representation learning, where CNNs were trained for multi-
label classification based on a set of words converted from titles, descriptions and hash-
tag metadata in the YFCC dataset. By following this direction, Li et al. [65] proposed
to use n-grams of words (instead of individual words) as the training target. Radford et
al. [93] incorporated contrastive learning, which minimizes the batch-based multi-class
N-pair loss for image and text embeddings, and showed excellent performance in a wide
range of vision and language tasks. More recent works have shown applications of text
information in visual question answering (VQA) [3, 73, 112, 80, 104].
Weak supervision for visual representation learning. Efforts to train CNNs using
non-manually-labeled signals have been explored in the past, e.g., [19, 21, 26, 121].
Chen and Gupta [13] proposed a curriculum-learning method that trains CNNs on easy
examples retrieved from Google Images and, then, finetuned it on weakly labeled image-
hashtag pairs. Izadinia et al. [46] finetuned a pretrained CNN using Flickr images and
a vocabulary of 5,000 words. Denton et al. [17] attempted to address user’s conditional
image and hashtag embedding problem based on a large collection of Instagram pho-
tos and hashtags. For data cleaning, prior studies [119, 25, 130] developed algorithms
to identify relevant image-hashtag pairs from a weakly labeled dataset by discarding
low-quality labels. Others [51, 47] conducted visual representation learning based on a
sufficiently large amount of weakly supervised training data (e.g. tags). More recently,
self-supervised or unsupervised training techniques are applied in representation learn-
ing models [122, 57, 23].
19
Chapter 3
Active Learning: Balancing
Uncertainty and Diversity
3.1 Introduction
Active learning is a technique that deals with abundant unlabeled data with limited label-
ing resources. Instead of labeling the entire dataset, active learning methods count on a
certain query strategy that chooses a specific set of samples from the unlabeled data pool
and then ask annotators to label them. This process is usually done iteratively. Given a
total budget of the labeled sample number (or percentage), the goal of active learning is
to maximize the accuracy of the model based on the selected set of labeled data samples.
It plays a critical role in building a large-scale dataset and has attracted a lot of attention
in recent years.
In the context of this chapter, we will mainly focus on the batch-mode active learn-
ing. The batch-mode active learning methodology [40, 33] was originally proposed to
reduce the number of query iterations for pool-based active learning. A typical setting
of early pool-based active learning methods is selecting only one unlabeled instance at a
time, while re-training the classifier at each iteration. When the training process is time
consuming, too many rounds of repeated re-training is inefficient. This is common in
the training of large CNNs. The batch mode that selects multiple unlabeled samples per
iteration is more suitable for CNN-based image classification tasks.
20
On one hand, sampling in larger batches can reduce the iteration number. On the
other hand, if a sampling strategy fails to exploit the correlation of data samples in one
batch per iteration, the batch-mode method may yield a less informative labeled set
under the total budget constraint. Some researchers [6, 126] select a batch of samples
that have top uncertainty scores. However, the relationship between selected samples
is not well considered. The effect of batch sampling in the setting of traditional image
classifiers such as the logistic regression and the support vector machine (SVM) has
been intensively studied in early work [41, 11, 5]. Yet, we see fewer activities [55] on
the development of uncertainty metrics that are compatible with batch-mode CNN-based
active learning.
Figure 3.1: An illustration of the problem of existing methods. Selecting samples based
on an uncertainty method or an diversity method alone may lead to a batch that is not
informative enough.
Applying an uncertainty based strategy or a diversity based strategy alone may not
be sufficient enough. As depicted in Figure 3.1. Uncertainty based methods will choose
samples that are closest to the decision boundary of some sorts, which may suffer from
a group of outliers that are somehow uncertainty to current model. Diversity based
methods will choose samples that are representative of a local cluster of samples, which
fails to take the uncertainty of the model into account.
21
In this chapter, we propose two methods that can jointly consider uncertainty and
diversity, k-covers and TBAL. The details of the proposed methods are described in
Sec. 3.2. The experimental results are shown and analyzed in Sec. 3.3. The conclusions
are discussed in Sec. 3.4.
3.2 Methodology
3.2.1 Uncertainty Metrics
Uncertainty-sampling-based query strategies assume that a larger performance gain will
be obtained by labeling samples of high uncertainty. The intuition is that it would be
beneficial to provide labels to those samples which the current model is not sure. There
are a few metrics developed to measure whether an unlabeled sample is uncertain or not.
• Entropy [101]
The entropy is a commonly used metric to measure the uncertainty of a random
variable. The entropy of an image sample,x, can be computed as
H
Ent
(x) =
C
X
c=1
p(y =cjx) log
p(y =cjx)
; (3.1)
wherep(y = cj x) is the estimated probability ofx to be classified to classc in
the softmax output of our model.
• Variation Ratio [29]
The variation ratio measures the level of confidence in prediction. It is defined as
H
Varr
(x) = 1 max
c=1:::C
p(y =cjx): (3.2)
22
• Margin Sampling [117]
The margin sampling measures the difference of probabilities between the most
and the second probable classes. It is defined as
H
Margin
(x) = 1
p(y =c
1
jx)p(y =c
2
jx)
!
; (3.3)
wherec
1
andc
2
are the most and the second probable classes, respectively.
• MC-Dropout [29]
The MC-dropout estimates the Bayesian uncertainty function that approximates
the prior distribution of model parameters by the dropout layer in CNNs [106]. It
is worthwhile to mention that model parameter in a dropout network is a ran-
dom variable rather than a deterministic value. The solution to the loss function
optimization offers only one value of that maximizes the log-likelihood estima-
tion (MLE). By the law of total probability, the Bayesian estimate of the posterior
distributionp(yjx) can be written as
p(yjx) =
Z
p(yjx;)p()d; (3.4)
wherep() is the prior distribution of. Since there is no explicit form ofp(),
we use the Monte-Carlo method to approximate the integration in Eq. (3.4):
^ p(yjx) =
1
t
t
X
j=1
^ p(yjx;
j
); (3.5)
where ^ p denotes the output of the network with parameter
j
, and where
j
,j =
1t, aret observed values of random variable under distribution functionp().
Under the assumption that the distribution of random variable can be implicitly
characterized by the randomized dropout of CNNs, Gal et al.[29] obtained the
23
Monte-Carlo approximation in Eq. (3.5) by letting ^ p(yjx;
j
) be thej-th forward
output of the network with randomized dropout.
In the context of this chapter, we use the variation ratio as the basic uncertainty
function. Then, by following [29], we compute the Monte-Carlo Variation Ratio in
form of
H
MCVarr
(x) = 1 max
c=1:::C
^ p(y =cjx);
= 1 max
c=1:::C
1
t
t
X
j=1
p(y =cjx;
j
);
(3.6)
where ^ p denotes the feedforward output of the network with randomized dropout in the
j-th computation.
3.2.2 K-covers
Our method tries to find the data samples with the largest uncertainty in each cluster
which is determined by our proposed k-covers clustering approach. The intuition behind
our proposed method is that we want our chosen samples to be evenly distributed in the
feature space. For those samples that are partitioned into the same cluster, which means
that they are visually similar, we only need to choose one representative from them to
avoid potential redundant labeling efforts.
System Overview
In pool based active learning, we only consider a finite set of data samples S that are
drawn from the real data distribution D. Among S, we have an initially labeled set
L
(0)
which is usually small, as well as the unlabeled set U
(0)
(S = U
(0)
[L
(0)
and
U
(0)
\L
(0)
=;).
24
We assume that there are b rounds of data acquisition until we are running out of
the budget. In each stepi, we train a modelM
(i)
onL
(i)
. Then we use an acquisition
functiona(U;M) to determine which subsetT
(i)
U
(i)
to get labeled in this step. After
that, we updateU andL accordingly:
T
(i)
=a(U
(i)
;M
(i)
);
U
(i+1)
=U
(i)
T
(i)
;
L
(i+1)
=L
(i)
[T
(i)
;
(3.7)
and the objective of active learning is to find the besta that maximizes the accuracy of
finalM
b
.
Cover Radius As An Upper Bound
As is shown by the theoretical result in [100], whose Theorem 1 states that, for any set
of data pointsS and its subsetT , if we assume that setS can be covered by its core set
T (T S) with the radius
T!S
(which means for each data pointx inS, we can find
a data pointx
0
inT , such thatkxx
0
k
T!S
), the difference of the average training
losses onS andT is upper bounded by the radius
T!S
plus
1
p
jSj
, up to a constant scale
factor:
1
jSj
X
x
i
;y
i
2S
l(x
i
;y
i
;
S
)
1
jTj
X
x
i
;y
i
2T
l(x
i
;y
i
;
T
)
O(
T!S
) +O(
1
p
jSj
); (3.8)
wherel is the loss function of the modelM,
S
and
T
are the parameters trained on set
S andT , respectively.
In this case, if we let S to be the pool of data samples and T to be the subset of
samples that are chosen to get labeled in each step, the difference of the average training
losses betweenS andT is well bounded by
T!S
. Since the training loss can be regarded
25
as a reasonable estimation of the testing loss, we could then use
T!S
to infer how good
a subsetT is, in term of the testing loss. SinceS is usually very large, we can basically
assume that the right hand side of Equation 3.8 is dominated by theO(
T!S
) term.
Inspired by the theoratical relationship of the radius of a cluster as shown in Equation
3.8, we propose the k-covers objective function:
arg min
T
jTj
X
i=1
max
x2Z
i
kxT
i
k;
(3.9)
whereT is the set of “centroids”,Z
i
is the set of points that are closer to the centroidT
i
than any other centroids. A simple interpretation is that, the maximum of distances to
centroid could be a reasonable metric to reflect the integrity of a cluster.
The objective of k-covers is pretty similar to the k-means objective which aims to
minimize the sum of distances from each point to its closest centroid:
arg min
T
jTj
X
i=1
X
x2Z
i
kxT
i
k; (3.10)
which suggests that we can design an approximation algorithm for our k-covers that is
similar to the Floyd’s algorithm for k-means.
We use the k-covers objective function to build clusters such that the radius of each
clusterZ
i
is small. Since the smaller the radius is, the more similar those data samples in
a cluster are, choosing too many samples in one cluster may reduce the overall diversity
of acquired labeled data. Therefore, one sample in each cluster is enough to the represent
the whole cluster. Similar to [85] and [89] which also combine the cluster information
and the uncertainty information in their data acquisition function, we choose the sample
with the largest uncertainty in each cluster to get labeled.
26
Solving K-covers
Similar to the Lloyd’s algorithm [71] to solve the k-means objective in Equation 3.19,
our algorithm has two steps in each iteration: assignment step and update step. The
assignment step will assign each point to its closest centroid and form the setZ, which
is the same as K-Means. The update step will calculate the centroid that minimizes the
maximum distance toZ
i
, while K-Means is to minimize the sum of distance.
Given an initial set of centersC
(0)
, then at iteration (t), we have:
assignment step
Z
(t)
i
=
n
x
j
arg min
c2C
(t)
k
[L
(t)
kx
j
ck =C
(t)
i
o
; (3.11)
update step
C
(t+1)
i
= arg min
z
max
x2Z
(t)
i
kxzk; (3.12)
update step - approximation
C
(t+1)
i
= arg min
z2U
(t)
max
x2Z
(t)
i
kxzk: (3.13)
For the Equation 3.11 in the assignment step, it is mostly the same as the one in a
normal k-means clustering. The major difference is that those samples that are assigned
to the existing labeled samples inL
(t)
will be temporarily excluded from the following
update step.
We can find that Equation 3.12 in the update step is essentially a minimum bounding
sphere problem [95]. Other than the linear programming based method in [27] or the
even more recent “extremal points optimal sphere” method in [62], we found that a
simple approximation shown in Equation 3.13 by forcing to choose the centroids on
existing data samples works pretty well in our empirical studies.
27
Overall Acquisition Function
From the uncertainty functions that are discussed in Section 3.2.1, we can see that all of
them need the softmax output of the neural networks as the input of the estimation of
posterior distribution. However, the softmax output may not be able to reflect the real
probability when the labeled data for some classes are not enough, which could make
the acquisition function biased. In this case, we claim that introducing geometrical
information as illustrated in Section 3.2.2 would help to select samples that are more
diverse and thus are more class balanced.
As a result, our final acquisition function is described as followed. In each round,
we first compute the k-covers clustersZ (Z
i
contains the data samplesx
j
in clusteri)
based on the bottom layer features ofx
j
2 U. Then, we choose one sample with the
largest uncertainty in each cluster:
a(U;M) =
(
arg max
x2Z
i
f
(x);i = 1::K
)
; (3.14)
whereK is the number of clusters andf
can be any uncertainty function. K is set to
the number of samples to get labeled in this step.
3.2.3 TBAL
We consider batch-mode active learning for the image classification task in this section.
Specifically, we propose a new method, called the Two-stage Batch-mode Active Learn-
ing (TBAL). In this section, we will first provide an overview of the TBAL solution.
Then, we will give detailed discussion on the first and the second stages.
28
System Overview
The system diagram of the TBAL method is shown in Fig. 4.3. It has multiple data
acquisition rounds. In each round, the TBAL method will determine a subset of unla-
beled samples for labeling, which is indicated by the blue boxes in the figure. Each
acquisition round has two stages. In the first stage, we apply semi-supervised training
to all labeled and unlabeled samples to produce an individual uncertainty score and a
high-dimensional feature representation for each unlabeled sample. In the second stage,
a novel technique, called Cluster Re-Balancing (CRB), is proposed to tailor our query
strategy to the batch mode setting by taking into account samples’ correlation and diver-
sity in a batch.
We introduce parameters of the TBAL system below. The data acquisition process
contains T rounds. Parameters S
0
and U
0
are labeled and unlabeled datasets in the
initialization stage. In the i-th round, we choose B samples from the unlabeled set,
U
i1
, for labeling. The batch of samples that is labeled in thei-th round is denoted by
S
i
. Apparently,jS
i
j =B. At the end of the (i 1)-th round the labeled dataset is
L
i1
=[
i1
j=0
S
j
; i = 1; ;T; (3.15)
and the unlabeled dataset is
U
i1
=U
i2
S
i1
; i = 1; ;T: (3.16)
We use L
i1
(and U
i1
optionally) to train a deep learning model, M
i
, and design an
acquisition functionA(M
i
;U) to select a batch of data samples, denoted byS
i
for label-
ing in thei-th round. Mathematically, we have
S
i
=A(M
i
;U
i1
): (3.17)
29
A trained deep learning model,M, provides two functions to be used in active learn-
ing.
• Feature Function
The feature function, denoted byf(x;L;U)2 R
d
, outputs ad-dimensional fea-
ture vector with respect to input image x for given labeled set L and unlabeled
setU. For a CNN designed for the image classification task, its feature function
is usually composed by responses of the final feature representation layer. It is
usually before either the fully-connected layer [74] or the softmax [100] layer of
the network.
• Distribution Function
The distribution function, denoted byp(x;L;U)2 [0; 1]
C
, is the estimated proba-
bility distribution ofC classes for input imagex under labeled setL and unlabeled
setU. The distribution function of the output (after softmax) of the network. Fur-
thermore, the individual score of input image x can be defined on p(x;L;U),
which will be discussed in Sec. 3.2.1.
Clearly, the query strategy should be correlated with the feature and the distribution
functions for the following reasons. First, uncertainty sampling chooses those samples
with topB uncertainty values, while the uncertainty function can be computed from the
distribution functionp(x;L;U). Second, diversity sampling chooses samples based on
the feature distribution function. For instance, the greedy approximation of the coreset
active learning algorithm [100] chooses samples that are farthest away from the current
labeled set. Our query strategy, called the cluster re-balancing (CRB), considers the
feature and the distribution functions jointly, which will be detailed in Sec. 3.2.3.
30
Stage I: Model Update
Generally speaking, we need to define the “priority” of an unlabeled sample, x, with
respect to a deep learning model. This is conducted in Stage I. We will discuss how
to update a deep learning model in the i-th round. Then, we will use the uncertainty
metrics in described in Sec. 3.2.1 to measure the priority scores.
We useL
i1
U
i1
to update the deep learning model,M
i
, in thei
th
round. Besides
measuring the informativeness of unlabeled samples, we can use them for uncertainty
estimation and model update. To trainM
i
, we need to define the loss function. Along
this line, we introduce the consistency regularizer to the -model [61]. The loss function
consists of two loss terms. The total loss is the weighted sum of the supervised loss and
the consistency loss as given below:
Loss() = min
X
x
k
2L
i
l(p(x
k
;);y
k
)
+
X
x
j
2L
i
[U
i
kp(x
j
;)p( ~ x
j
;)k
(3.18)
where the first term is the supervised loss and the second term is the the consistency loss.
The supervised loss is the sum of the categorical cross-entropy function, l, on labeled
data. The consistency loss is the sum of Euclidean distances between image samplex
j
and its perturbed neighbor ~ x
j
.
The consistency loss is used as a regularization term to avoid potential overfitting
due to a small number of labeled samples. It can be justified by the fact that our network
should have roughly the same responses to any input data perturbed by noise (e.g. Gaus-
sian noise), stochastic augmentation or dropout. Imposing this regularizer on unlabeled
data can improve the quality of the uncertainty estimation function as well as the hidden
feature space.
31
Stage II: CRB Query Strategy
A query strategy that selects a batch of samples with top priority scores may not be
effective since it does not consider redundancy among selected samples. For instance,
labeling one image of high uncertainty may reduce the uncertainty of other similar
images. Then, selecting visually or semantically similar images of high uncertainty
in one batch could be a waste. To address this problem, it is desired to increase the
diversity of selected samples in a batch. This can be achieved by getting samples from
multiple clusters of unlabeled data in a balanced manner. Our query strategy, called
the cluster re-balancing (CRB) query strategy, takes into account the distribution of the
high-dimensional feature space and the diversity of unlabeled data jointly.
Our CRB query strategy consists of three steps. It can roughly stated below.
1. Build initial clusters of candidate samples by running the k-means algorithm on
images with top scores in the product ofBR , whereB andR are measures of
the sample budget and the sample rate. The sample rateR controls the number of
the candidates to process.
2. Allocate budgetB toK clusters based on the difference of unlabeled and labeled
samples of each cluster. Intuitively, we should give a higher budget to a cluster of
a larger difference between unlabeled and labeled samples.
3. For a selected cluster, we should choose samples are least similar to those samples
that have been labeled.
These three steps will be elaborated below.
32
Step 1: Initial Clustering. The objective of the k-means clustering algorithm is to
minimize the sum of distances from each point to its closest centroid:
arg min
C
jCj
X
i=1
X
x2Z
i
kF (x)F (C
i
)k; (3.19)
whereC is the set of centroids,Z
i
is the set of points that are closer to centroidC
i
than
any other centroid. The running time of the k-means algorithm depends on the cluster
number and the total data number. We would like to construct the initial clusters of can-
didate samples in this step. Its pseudo-codes are given in Algorithm 1 for clarity. It has
two key parameters: 1) the cluster numberK and 2) the candidate rateR. Specifically,
the candidate rate R controls the size of the candidate set. In all experiments in this
work, we haveK = 2C, whereC is the class number andR = 2. It is worthwhile to
emphasize that we perform clustering based on those unlabeled samples of top priority
scores instead of all unlabeled samples in the k-means algorithm. This is done for two
reasons. First, it is desired to eliminate the majority of less informative instances in the
unlabeled pool. Second, the k-means algorithm converges faster with fewer samples.
Step 2: Budget Allocation. We allocate budgetB toK clusters in this step. Intuitively,
if a cluster has a large number of unlabeled samples but few labeled ones, we need to
label more samples in it. Its pseudo-codes are given in Algorithm 2. In words, we assign
all samples in labeled setL and unlabeled setU to their closest centroids and allocate
the budget to each cluster based on the difference between the proportions of unlabeled
and labeled images while ensuring each cluster has at least one image labeled.
Step 3: Cluster Re-Balancing. We adopt an KL-divergence-based heuristic to improve
the diversity of selected samples in this step. The pseudo-codes of Step 3 are given
in Algorithm 3, where the KL divergence between the prior label distribution and the
estimated label distribution of samplex is used to measure the potential gain in labeling
33
Algorithm 1: Pseudo-codes for Step 1 of the proposed CRB query strategy.
Input : Feature functionF (x), priorityH(x) for the input imagex,
labeled setL, unlabeled poolU, batch sizeB.
Parameter: Number of clustersK and candidate rateR.
Output :K clusters with centroidC
k
and corresponding candidatesCand
k
.
Compute the priority scoreH(x) forx2U ;
Candidate set size
~
B RB ;
Construct candidate set
~
U to be the subset ofU with the top
~
B of highest
priority scoresH(x) ;
Run the k-means on
~
U with feature functionF (x) to get theK centroidsC
k
(k = 1::K) ;
InitializeCand
k
; fork = 1::K. forx2
~
U do
k
0
arg min
k
kF (x)C
k
k ;
Cand
k
0
Cand
k
0
[x ;
end
Algorithm 2: Pseudo-codes for Step 2 of the proposed CRB query strategy.
Input : Cluster centroidsC
k
, candidate setsCand
k
, labeled setL,
unlabeled poolU, batch sizeB.
Output : A set of imagesSU withjSj =B representing the selected
batch.
Assign images inL[U to their closest centroidC
k
. (*);
forx 1 toK do
p
U
(k) proportion of unlabeled images forC
k
;
p
L
(k) proportion of labeled images forC
k
;
Weightw(k) max(p
U
(k)p
L
(k); 0);
Budget(k) max(Bw(k); 1);
end
x. The prior label distribution of clusterk can be approximated by the distribution of the
labeled data in the cluster. By selecting images that are least similar to samples in the
prior distribution, we expect to achieve higher diversity of labeled data in a cluster. Since
our heuristic maximizes the KL divergence between the estimated labeled distribution
and the prior label distribution, it chooses the sample that is likely to be different to the
current “popular” labels in the cluster. To give an example, if a cluster has 50% cat
34
images and 50% dog images, it is desired to add an image that is not a cat neither a dog
to increase the diversity of labeled samples in the cluster.
Algorithm 3: Pseudo-codes for Step 3 of the proposed CRB query strategy.
Input : Cluster centroidsC
k
, candidate setsCand
k
, labeled setL,
unlabeled poolU, batch sizeB.
Output : A set of imagesSU withjSj =B representing the selected
batch.
S ;;
fork 1 toK do
Compute prior ground-truth label distribution ~ p
k
(y =c) for labeled images
assigned toC
k
by re-using the assignment results ofL in the (*) step in
Algorithm 2;
Compute the KL divergenceKLD
k
(x) between ~ p
k
(y =c) andp(y =cjx)
forx2Cand
k
;
Sort candidatesCand
k
byKLD
k
;
S
k
firstBudget(k) images inCand
k
;
S S[S
k
;
end
Since image samples included in the candidate set are basically “informative”
enough, a small difference in their estimated priority scores can be neglected. Instead,
we pay more attention to two other properties in our CRB query strategy: 1) diversity in
the feature space and 2) diversity in the label distribution. They are addressed in Step 2
and Step 3, respectively.
3.3 Experiments
3.3.1 Analysis of Batch Redundancy Problem
To demonstrate the batch redundancy problem in real dataset, we use t-SNE [76] to
visualize the distribution of images in MNIST dataset. We train an S-CNN with 50
random images with 128 dimensional features before the dimension reduction of t-SNE.
35
As shown in Figure 3.2, the redundancy problem generally exists for all three tested
uncertainty metrics.
Figure 3.2: Visualization of the selected batch sampled by different uncertainty metrics.
Both tested uncertainty metrics have the redundancy problem.
3.3.2 Experimental Setup of K-covers
We performed experiments on three datasets: MNIST [64] for handwritten digits recog-
nition task, CIFAR-10 [60] for image classification task. Besides using a class-balanced
initialization for active learning as many recent works [6, 100, 29] do, we also compared
the performance with class-unbalanced initialization.
The experiments were conducted to compare performance of our proposed algorithm
with the following methods: i) random baseline: uniformly choose the samples in the
pool, ii) uncertainty based methods: choose the samples with the largest uncertainty
function. Three uncertainty functions are tested: entropy [50], variation-ratio [29] and
MC-dropout-variation-ratio [29], iii) core-set greedy [100]: choose the samples with the
largest distance to the chosen samples, iv) k-median [100]: choose the cluster centroids
after performing k-median algorithm on the pool of unlabeled samples.
36
Dataset Model Initial Size Step Size # Steps
MNIST S-CNN 50 100 9
CIFAR-10 ResNet20 500 2000 5
Table 3.1: Settings used for the evaluation of pool based active learning.
The basic settings with the model and hyperparameters we used in the evalution is
listed in Table 3.3. For S-CNN, we use a shallow neural networks with two convolutional
layers and one dense layer. This S-CNN is the same as the one in [6]. For ResNet20,
we build the network using three stages of residual blocks with bottleneck design [39],
which has a total of eighteen layers of convolutional layers and two dense layers.
For all models, we train the network using ADAM optimization [54] with He’s ini-
tialization [37] in 150 epoches. The learning rate is adjusted using a decaying scheme
after the validation loss plateauing [38]. Also, we use the same data augmentation as in
[38]. All the implementation of CNNs is based on Keras.
For the implementation of our K-Covers based method, we evaluated it with multiple
uncertainty functions: entropy, variation ratio and monte-carlo dropout variation ratio,
as discussed in Section 3.2.2. The number of clusters is set to be the batch size, which
is 128 in our case.
3.3.3 Experimental Results of K-covers
Visualization of Selected Batch
The visualization results of the selected batch generated by k-covers is shown in Figure
3.3. The left sub-figure shows the visualization result of the selected batch of k-covers
and the right sub-figure represents the visualization result of uncertainty sampling. It
is clear that k-covers sampled a more balanced batch of data that are roughly diversely
37
Figure 3.3: Visualization of the selected batch sampled by k-covers and uncertainty
sampling. K-covers can select a more balanced batch of unlabeled data, which are also
of high uncertainty (near the decision boundary of two or more labels).
distributed in the feature space and is also near the decision boudary of two or more
labels, which means they are samples of high uncertainty.
Characterizing The Unbalancedness of Initial Set
Entropy can be used to describe the unbalancedness of a labeled set. We use four differ-
ent distributions of initial set with four levels of entropy, i.e. unbalancedness, as shown
in Table 3.2. The distribution will be randomly shuffled before we start to sample the
initial set.
Entropy Label Distribution
3.32 [.10 .10 .10 .10 .10 .10 .10 .10 .10 .10]
2.30 [.01 .01 .01 .02 .07 .08 .08 .09 .10 .53]
1.30 [.01 .01 .01 .01 .01 .01 .01 .01 .17 .75]
0.80 [.01 .01 .01 .01 .01 .01 .01 .01 .01 .91]
Table 3.2: Entropy and its corresponding label distribution that is used in the class-
unbalanced initialization experiments.
38
Active Learning Curve
(a) MNIST.
(b) CIFAR10.
Figure 3.4: Active learning curves. We can observe a significant drop of performance
for all uncertainty methods (solid red/green/blue line), when the initial distribution is
unbalanced, while our method i.e. the dotted lines can still perform well. Our method
can also outperform the two geometry methods (solid purple/orange line) with a large
margin.
The evaluation results in MNIST and CIFAR-10 can be found in Figure 3.4. The
leftmost subfigures indicate the results of balanced initialization while the right most
subfigures indicate the results of an extreme unbalanced initialization case. The solid
lines indicate the uncertainty based methods and the dotted lines indicate our method.
From Figure 3.4, we can have the following observations:
1. The performance of uncertainty methods drop significantly when the initialization
is unbalanced.
2. Our method is robust to the unbalanced initialization.
39
3. When the initailization is balanced, our method can still slightly outperform the
corresponding uncertainty method, e.g. kcovers+entropy v.s. entropy, as well as
other state-of-the-art geometry based methods.
(a) varr. (b) kcovers+varr.
Figure 3.5: Distribution produced by different data acquisition functions during the
active learning process. X-axis indicates the steps. Y-axis indicates the distribution
where different color means different labels (they means 0 to 9 from bottom to top).
The labels are based on the ground truth. These distributions are drawn from a random
trial of active learning in MNIST when the initial distribution is extremely unbalanced
(entropy=0.8).
Label Distribution
We ran an experiment to compare the distribution of labels during the active learning
process. It is intuitively that the uncertainty method may not work well in the first few
rounds where the uncertainty metrics are not quite accurate. The results are shown in
Figure 3.5.
From Figure 3.5, we can found that the distribution chosen by variation ratio method
is pretty biased towards the orange class while our method can generate an overall more
balanced distribution in some sense.
40
3.3.4 Experimental Setup of TBAL
Datasets
To compare the performance of our TBAL method with other active learning methods,
we conduct an extensive experimental evaluation on three public large-scale image clas-
sification datasets. They are:
• SVHN (Street View House Numbers) [84]
It consists of 73,257 training images and 26,032 testing images of real-world
housing numbers. They are color images in RGB format and of resolution 3232.
• CIFAR-10 [60]
It is a 10-class image classification dataset that contains 50,000 training images
and 10,000 testing images. They are color images in RGB format and of resolution
32 32.
• CUB-200-2011 [115]
It is a 200-class bird image dataset. It contains 11,788 color images in RGB format
and of resolution around 500 350.
Benchmarking Methods
We compare our TBAL method with the following two baseline methods and three state-
of-the-art semi-supervised active learning methods:
• Uncertainty-only baseline
It simply chooses a batch of samples with the largestB uncertainty scores. Four
uncertainty functions are considered: variation ratios (Varr), max margin, entropy
and Monte-Carlo variation ratios (MCVarr).
41
• Diversity-only baseline
It runs the greedy based core-set selection [100] on deep features derived from the
convolutional neural networks.
• V AAL [105]
• CEAL [117]
• LL4AL [126]
Figure 3.6: The protocol of evaluating query strategies in supervised and semi-
supervised settings.
Evaluation Protocol
We propose a new evaluation protocol to measure the informativeness of the samples
selected by different active learning strategies. With this protocol, we would like to see
how a sample selected by an active learning strategy contributes to the overall model
accuracy when no extra unlabeled data is available. This protocol is especially valuable
for an active learning strategy in the semi-supervised setting because it can isolate the
effect of unlabeled data. It is decribed below.
42
We first randomly select an initial sample set from the training set and re-train all
models from scratch after each data acquisition round. The initial sample set is class-
balanced and are pre-generated for each dataset. It is used by any active learning method.
Afterwards, different active learning methods adopt different query strategies. We evalu-
ate semi-supervised accuracy of a query strategy as well as its transfer to full-supervised
accuracy as shown in Fig. 3.6.
• Semi-supervised accuracy
We run a query strategy using the semi-supervised setting (see Fig. 4.3) to get
batchS
i
at thei-th round, and measure its semi-supervised accuracy with selected
batchesS
0
; ;S
i
.
• Transfer to full-supervised accuracy
We transfer the semi-supervised setting to the full-supervised setting. That is, we
fix selected batches and re-train a model with the same architecture using labeled
data only. In this case, we set = 0 for the loss function in Eq. (4.8).
For a given query strategy, we run the data acquisition process on the three initial sets.
With three repeats of the whole acquisition process, we calculate the mean of a total of
nine runs to reflect the performance of each query strategy.
Implementation Details
We use ResNet-18, DenseNet and ResNet-152 for SVHN, CIFAR-10 and CUB-200-
2011, respectively. The baseline models and their hyperparameters in the evaluation are
given in Table 3.3. For the ResNet, we train the network using the SGD optimizer with
the Nesterov momentum. Its architecture is constructed with the built-in implementation
of ResNet-18 and ResNet-152. For the DenseNet, we build it using three stages of dense
blocks, the bottleneck design [43], and a dropout rate of 20%. It is trained using SGD.
43
Dataset Model # Initial # Step Size # Steps
SVHN ResNet-18 100 500 5
CIFAR-10 DenseNet (k=12, d=40) 500 1000 5
CUB-200-2011 ResNet-152 500 1000 5
Table 3.3: Summary of experimental settings, including datasets, CNN models and
hyper-parameters.
The initial learning rate of 0.1. Then, it is reduced to 0.01 at 50% and 0.001 at 75%.
This setting is the same as that in [43] for CIFAR10 and SVHN. Both architectures are
implemented in Keras [14].
3.3.5 Experimental Results of TBAL
We evaluate the proposed TBAL method and compare it with various baseline active
learning strategies in this section. Experiments on uncertainty/diversity balancing and
on different uncertainty metrics and semi-supervised training methods are given in Sec.
3.3.5 and Sec. 3.3.5, respectively. Then, the proposed TBAL method is compared with
other state-of-the-art active learning methods in Sec. 3.3.5.
Uncertainty/Diversity Balancing
A query strategy that considers uncertainty only tends to suffer from duplication when
the sampling batch size is large. Many samples of the same label might be selected
since they are all close to the decision boundary and, thus, of higher uncertainty. This is
apparently not ideal. Duplication in the early data sampling stage can harm the overall
performance significantly.
We compare the mean accuracy performance curves of the TBAL method (“U +
D with CRB”), a baseline uncertainty/diversity-combined method that simply chooses
the top-1 sample after Algorithm 2 as described in Sec. 3.3 (“U + D with top-1”),
44
(a) SVHN (b) CIFAR-10 (c) CUB-200-2011
Figure 3.7: Comparison of accuracy performance curves for three datasets: (a) SVHN,
(b) CIFAR-10, and (c) CUB-200-2011. The two rows of figures represent semi-
supervised accuracy and transferred supervised accuracy, respectively. The performance
curves of a random baseline, the diversity-only baseline, the uncertainty-only baseline,
a simple combination method and our CRB-based method are shown in blue, orange,
green, purple and red colors, respectively.
uncertainty-only and diversity-only methods that are described in Sec. 3.3.4, and a
random selection baseline for SVHN, CIFAR-10 and CUB-200-2011, respectively. The
accuracy curves are shown in Fig. 3.7. We have the following four observations from
this figure.
1. The TBAL method (red curve) works the best. It outperforms all other baseline
methods with a clear margin.
2. Combination based methods (purple and red curves) work generally better, while
the transferred supervised accuracy is significantly improved over other baselines.
3. Uncertainty only based method (green curve) is not robust to transferred super-
vised task. It is even getting severely worse results than the random baseline in
45
Figure 3.8: Comparison of the mean performance gain of different versions of the pro-
posed TBAL method on SVHN, CIFAR-10 and CUB in terms of semi-supervised accu-
racy (left) and transferred supervised accuracy (right). The TBAL method with MCVarr
and CRB (red bars) achieves the best performance in both settings.
SVHN task which is mostly caused by the bad performance in the first a few
rounds where uncertainty functions usually do not work well.
4. Diversity based method (orange curve) is only slightly better than the random
baseline. We assume that this phenomenon is mainly caused by its lacking in
efficiency in later rounds of data sampling process where the informativeness of
selected samples becomes more important to the performance gain.
The averaged difference is compared in Fig. 3.8. For all three test datasets, “U+D”
based methods get the best performance in both semi-supervised accuracy and trans-
ferred supervised accuracy metrics. The overall best performance is achieved by the
CRB-based “U+D” method. It has a gain of 0.47% and 2.40% for SVHN, 1.41% and
2.34% for CIFAR-10, and 6.29% and 5.78% for CUB, in semi-supervised accuracy and
transferred supervised accuracy, respectively. In contrast, the uncertainty-based method
does not work well in transferred supervised accuracy in all three. It has an even worse
performance than the random baseline. We have three observations from this figure.
1. Uncertainty sampling harms the transferred supervised accuracy significantly.
2. The combination of uncertainty and diversity improves the performance.
46
3. The CRB strategy has the best performance in both semi-supervised accuracy and
transferred supervised accuracy.
Robustness of TBAL
(a) SVHN (b) CIFAR-10 (c) CUB-200-2011
Figure 3.9: Performance comparison between semi-supervised training methods with
different uncertainty functions, where the three rows stand for three uncertainty func-
tions, three columns indicate three datasets, and solid and dotted curves represent
CEAL-based and TBAL-based semi-supervised training methods, respectively.
47
Figure 3.10: Comparison of averaged accuracy based on different uncertainty functions,
where individual performance curves are shown in Fig. 3.9.
To demonstrate the robustness of the proposed TBAL method under different uncer-
tainty metrics, we conduct experiments using three uncertainty functions: (1) varia-
tion ratios, (2) entropy, and (3) margin. We also compare it with two different semi-
supervised learning methods: TBAL-S1 (the first stage of the proposed TBAL) and
CEAL-S1 (a semi-supervised method based on thresholding). The accuracy curves are
shown in Fig. 3.9, where the three rows stand for three uncertainty functions, three
columns indicate three datasets, and solid and dotted curves represent CEAL-based
and TBAL-based semi-supervised training methods, respectively. We see that TBAL
achieves the best performance in all cases. The averaged accuracy based on different
uncertainty functions is illustrated in Fig. 3.10. The best performance is obtained by the
MCVarr uncertainty function combined with TBAL-S1, +1:88%, as compared to other
uncertainty functions: +1:38%, +1:18%, +1:05% for TBAL based semi-supervised
training and +0; 97%, +0:62% and 0:35% for CEAL based semi-supervised training
methods. TBAL-S1 achieves consistently better performance than CEAL-S1 using all
three uncertainty functions.
48
Performance Benchmarking with State-of-the-Art Methods
(a) SVHN (b) CIFAR-10 (c) CUB-200-2011
Figure 3.11: Performance omparison of TBAL with two state-of-the-art methods, V AAL
and LL4AL, and the random selection baseline for SVHN, CIFAR-10 and CUB-200-
2011 three datasets.
We compare the proposed TBAL method with two state-of-the-art semi-supervised
active learning methods: V AAL [105] amd LL4AL [126]. The performance curves for
SVHN, CIFAR-10 and CUB-200-2011 are given in Figs. 3.11. The averaged perfor-
mance difference to the random baseline is illustrated in Fig. 3.12. For semi-supervised
accuracy, all three benchmarking methods have comparable performance in SVHN and
CIFAR-10, and TBAL performs slightly better than the other two in CUB. Specifically,
TBAL, LL4AL and V AAL have performance gains of 6:29%, 4:68% and 4:25% over the
random baseline. For transferred supervised accuracy, V AAL does not perform well for
SVHN and CIFAR-10. It is even worse than the random baseline by 0:86% and 0:03%,
respectively, as shown in Fig. 3.12. TBAL performs consistently better than V AAL and
LL4AL in all three datasets.
49
Figure 3.12: Comparison of the averaged performance gain for V AAL, LL4AL and
TBAL, where semi-supervised accuracy and transferred supervised accuracy are shown
in the left and the right subfigures, respectively.
3.4 Conclusion
We conduct experiments to identify and analyze the effect of batch redundancy prob-
lem for active learning. Based on the observation and the assumption that combining
uncertainty and diversity could help, we proposed two methods, k-covers and TBAL,
in order to address the batch redundancy problem. We show consistent performance
improvement of our proposed methods.
Specifically, in the experiments of k-covers, empirical results show that uncertainty
methods do not work well when the initial set is unbalanced. With the help of our
proposed k-covers method which partitions the feature space into several clusters, we
outperform baseline uncertainty based methods as well as two diversity based methods
with a clear margin regardless of the balancedness of the initial set.
Through the direction of balancing uncertainty and diversity, we further propose
TBAL to address the two main drawbacks of k-covers: information of large number
of unlabeled data being ignored in the model training stage and the number of clus-
ters depending on the batch size. In the first stage of TBAL, we present a new semi-
supervised learning framework that offers accurate uncertainty estimation and finds a
high-dimensional feature representation for unlabeled samples. It leverages a small
50
number of labeled data and a large number of unlabeled data. In the second stage of
TBAL, a novel technique called cluster re-balancing (CRB) is used. The CRB technique
exploits the information associated with the latent feature space to address redundancy
in a selected batch. It avoids selecting multiple sample images that are too similar in
the latent feature space. It can be regarded as a post-processing technique, which is
agnostic to any uncertainty measure and/or the feature space. The CRB module can
be integrated with any two-stage framework that has individual uncertainty scores in a
latent (or explicit) feature space. This condition is clearly met by CNN-based classifiers.
51
Chapter 4
Weakly Supervised Human Pose
Estimation
4.1 Introduction
Human pose estimation is a long-standing problem in computer vision research. It has
numerous applications such as sports, augmented reality, motion analysis, visual avatar
creation, etc. In the past ten years, a major advance has been made in building large-
scale human pose datasets and developing deep-learning-based models for human pose
estimation. As a result, estimating multi-person 2D poses and/or a single-person 3D
pose in complicated scenes become mature. Yet, multi-person 3D pose estimation is
still a challenging problem due to the lack of large-scale high-quality datasets for this
application. This is further hindered by inadequate tools to obtain reliable 3D informa-
tion for monocular images.
It is nontrivial to acquire high quality 3D supervisions in general in-the-wild scenes.
Depth sensors (e.g., Kinect) can provide useful data, yet the acquisition is typically
limited to indoor environments. Furthermore, it often demands a large amount of manual
work in capturing and processing. As an alternative, Li et al. [67] attempted to estimate
dense depth maps using the multi-view stereo (MVS) method from video clips captured
in stationary scenes.
The main objective of our current research is to obtain high quality 3D skeletons
from multiple 2D poses obtained in each frame of action-frozen people video. Such a
52
Figure 4.1: Exemplary video clips from the Mannequin dataset, where action-frozen
people keep certain poses in diverse environments such as cafe, school, living room,
hall, etc.
dataset was collected by focusing on a special kind of Youtube video. That is, people
imitate mannequins and freeze in elaborate and natural poses in a scene while a hand-
held camera tours the scene to create the desired video. It is called the Mannequin
dataset. This type of video is suitable for human pose estimation since it can provide
diverse poses from people of different ages and genders with different scales in a wide
variety of outdoor scenes. We propose a multi-view matching (MVM) method to achieve
this goal. With a large amount of in-the-wild video data labeled by 3D supervisions
automatically generated by MVM, we are able to train a neural network that takes a
single image as the input and generate the associated multi-person 3D pose estimation
as the output.
53
Our approach has several advantages in multi-person 3D pose estimation. First, the
scene does not have to be an indoor and/or lab environment. Second, there are a con-
siderable number of video clips available online. If needed, it is possible to make more
action-frozen people video at a low cost. Third, we can get high quality 3D skeleton
information by exploiting the static scene assumption. To the best of our knowledge,
there is only one in-the-wild 3D human pose dataset, called MPII-3DPW [113]. It relies
on Inertial Measurement Unit (IMU) sensors attached to a few actors or actresses. As
compared to MPII-3DPW, our approach is more scalable. It contains more diverse con-
tents in terms of subjects and environments. Some exemplary video clips in the Man-
nequin dataset are shown in Fig. 4.1. With the help of such video clips, we can obtain
diverse 3D human poses in a wide range of scenes in daily life.
Estimating 3D skeletons from predicted 2D poses of multiple views has been stud-
ied previously in several settings, e.g. the synchronized multi-camera lab environment
[22, 48], the synthesized sports video [8]. However, existing methods do not work well
in the current setting (namely, video of action-frozen people captured by a hand-held
camera), because they either have no knowledge about the strong assumption of static
scenes [48, 8] or their algorithms are computationally intractable [22]. In the action-
frozen video clip, each frame can be essentially regarded as a single view of the scene.
A typical 10-second video sequence sampled at 25fps will yield 250 views of the scene
from 250 viewing angles. As observed in [22], the main challenge is to build the corre-
spondence of predicted 2D poses in different frames. The optimization method in [22]
was developed to minimize the cycle consistency loss in a multi-camera lab environ-
ment with a small number of views (e.g. five cameras). The solution in [22] becomes
computationally intractable in our current case which has a larger number of views and
more people in the scene.
54
Figure 4.2: Visualization of estimated camera poses: exemplary image frames in two
video clips (left) and the corresponding estimated camera pose trajectory curves in red
(right).
Besides, pose tracking methods like [1] can be a potential option when we need to
determine which 2D poses in different frames corresponds to the same person, which is
the core step in our proposed method. However, we argue that they are also not efficient
enough in our case. The reason is that, when identifying the correspondence of multiple
2D poses, the 3D geometric constraints plays a more important role compared to the
visual similarity clue that is usually what the pose tracking methods rely on. And it is
not common for pose tracking methods to rely on the assumption of static scenes, the
3D geometric constraints are not fully utilized.
The pipeline of the proposed MVM method is shown in Fig. 4.3. First, we use
the OpenPifPaf method [59] to estimate multi-person 2D poses at each frame. Sec-
ond, we find the correspondence of 2D poses across multiple frames. This is achieved
55
Figure 4.3: Illustration of the generation of multi-person 3D poses by the proposed
MVM method. The input contains a sequence of image frames (shown in the first block),
2D pose estimation at each frame (shown in the second block), building correspondence
of 2D poses across multiple frames (shown in the third block), recovering 3D poses
from matched 2D poses (shown in the fourth block).
by adopting an approximation to the optimization objective in [22], where both geo-
metric constraints and appearance similarities are taken into account. Third, we apply
triangulation to groups of matched 2D poses and conduct bundle adjustment to recover
multi-person 3D poses. We provide both qualitative and quantitative results to show the
quality of the multi-person 3D poses in Youtube video.
In practice, we are interested in multi-person 3D pose estimation from a single
image. To accomplish this goal, we train a monocular multi-person 3D pose network,
which is a modification of the network proposed in [59], from multiple frames using
the 3D supervisions obtained by MVM. To demonstrate the effectiveness of 3D super-
visions provided by the MVM method, we conduct experiments on the 3DPW and the
MSCOCO datasets. Evaluation shows performance improvement in the 3D human pose
estimation accuracy for the 3DPW dataset and the 2D human pose estimation accuracy
for the MSCOCO dataset.
56
4.2 Methodology
4.2.1 Multi-View Matching Method
The overall pipeline of the proposed 3D multi-view matching (MVM) method is illus-
trated in 4.3. For each input clip, we first estimate the camera poses for all frames.
Then, a 2D pose estimator is independently applied to those frames to get the initial 2D
pose predictions. Next, we run the matching algorithm to find the correspondence of
2D poses across frames. After that, we use triangulation on 2D poses that belong to the
same person to get the initial 3D skeleton. Finally, the 3D skeleton is further finetuned
by the bundle adjustment algorithm for better consistency of the estimated 3D pose.
Camera Pose Estimation
By following an approach similar to that adopted by [67] and [131], we use ORB-
SLAM2 [83] to identify trackable sequences in each video clip and estimate the initial
camera pose for each frame. In this stage, we set the camera intrinsics the same as
the one provided in the original Mannequin dataset, and process a lower-resolution ver-
sion of the video clip for efficiency. Afterwards, we process each sequence at a higher
resolution using a visual SfM system [97] to refine initial camera poses and intrinsic
parameters. This method extracts and matches features across frames. Then, it con-
ducts the global bundle adjustment optimization. Our implementation is based on an
open-sourced multi-view stereo (MVS) system called COLMAP [98]. Two camera pose
estimation examples are shown in Fig. 4.2. We show an exemplary image frame in the
left and the estimated camera pose trajectory curve of for the corresponding video clip
in red in the right.
57
2D Pose Estimation
We use the OpenPifPaf method in [59] as the 2D human pose estimation baseline net-
work. It is trained using results from the keypoint detection task of the MSCOCO
dataset. For fair comparison with other similar work in [58] and [56], we adopt a
ResNet-152 [39] backbone feature extractor and run the 2D pose estimator on each
input frame independently to get the 2D pose predictions x
ij
along with confidence
scorew
ij
2 [0; 1].
Matching
The objective of this matching step is to determine a set of 2D poses that belong to the
same person. This can be mathematically stated as follows. Suppose that there areT
frames in a video clip. We usex
ij
2 [0; 1]
C2
to indicate thej
th
2D pose in thei
th
frame
withC joints for each 2D pose in the normalized image coordinates. For each personk,
we would like to determine a group of 2D poses, denoted by
G
k
=fx
i
1
j
1
;x
i
2
j
2
; ;x
i
M
k
j
M
k
g;
that are associated with the same person and M
k
is the number of frames this person
shows up.
Affinity Matrix
The matching criterion is based on the affinity function, denoted by A(x
u
;x
v
), of
two 2D posesx
u
andx
v
. The affinity function takes two factors into account; namely,
appearance similarityS and geometric distanceD. It is expressed as
A(x
u
;x
v
) =S(x
u
;x
v
)
1
1 + exp
D(x
u
;x
v
)
; (4.1)
58
where
1
1+exp(
D)
converts a distance measure to a similarity measure and
D(x
u
;x
v
) =
1
2C
C
X
c=1
d(x
c
u
;L
uv
(x
c
v
)) +d(x
c
v
;L
vu
(x
c
u
)) (4.2)
is a geometric distance measure between two poses. Furthermore, in Eq. (4.2),L
uv
(x
c
v
)
indicates the epipolar line of thec
th
joint of 2D posex from viewu to viewv andd(:)
is the Euclidean distance between a point and a line.
The appearance similarity, S(x
u
;x
v
), in Eq. (4.1) is calculated using the cosine
similarity of the features extracted from the last conv layer of the network described in
Sec. 4.2.1. The geometric distance,D(x
u
;x
v
), compute the average distance between
the epipolar lines of the 2D keypoints in one frame and the corresponding 2D keypoints
in the other frame. The overall affinity is a product of the appearance affinity and the
geometric affinity as shown in Eq. (4.1).
Mutual Consistency Maximization
Suppose that there are N people in total in the entire set of input frames. We use
y
ij
2 [1::N] to indicate the associated person index of posex
ij
. Our goal is to maximize
the following objective function:
maximize
y
N
X
k=1
X
(i
1
;j
1
)2G
k
X
(i
2
;j
2
)2G
k
w
i
1
j
1
w
i
2
j
2
A(x
i
1
j
1
;x
i
2
j
2
);
subject to fG
k
g is a partition forfx
ij
g:
(4.3)
In principle, Eq. (4.3) can be solved by finding the partition of the affinity matrix
with the spectral clustering method. However, this classical method is too slow to be
practical in our current context. To speed up this optimization process, we develop a
greedy algorithm that finds an approximation to the original objective. The main idea is
described below in words. The algorithm maintains a set of corresponding 2D poses. It
59
begins with the 2D pose that has the largest confidence from the pool of 2D poses, where
the confidence score is generated by the 2D pose network. At each time, we choose the
2D pose with the highest affinity score from the pool, and repeat the process until we
cannot find anyx
ij
that has an affinity score above threshold. The pseudo codes of the
proposed greedy 2D poses matching algorithm are given in Algorithm 4.
Algorithm 4: Pseudo codes for greedy 2D poses matching.
Input : 2D Posesx
ij
, Affinity MatrixA
Output:G
k
fork2 [1::N]
Initialize visited setV =;;
fork 1 toN do
Findx
i
0
j
0
= 2V with the largest confidence;
G
k
fx
i
0
j
0
g;
Findx
ij
such that
P
xpq2G
k
w
ij
w
pq
A(x
ij
;x
pq
) is the largest andi6=p for
anyx
pq
2G
k
.;
while Confidence ofx
ij
is above do
G
k
G
k
[x
ij
;
Findx
ij
= 2V such that
P
xpq2G
k
w
ij
w
pq
A(x
ij
;x
pq
) is the largest and
i6=p for anyx
pq
2G
k
.;
end
V V[G
k
;
end
Triangulation and Bundle Adjustment
The reconstruction of 3D poses from a group of corresponding 2D poses is a well-
studied question in the 3D geometry literature. Here, we use the direct linear transform
(DLT) algorithm [35] to estimate the 3D keypoints from multiple corresponding 2D
keypoints. Moreover, we use the RANSAC algorithm to eliminate outliers. The trian-
gulation is applied to different joints independently. Triangulated 3D poses may suffer
from small parallax angles, which results in a very large reconstruction error. Some
reconstructed poses are even not possible for human beings.
The common practice is to apply bundle adjustment on triangulated 3D poses to
get high quality estimates. To implement this idea, we minimize an error function that
60
considers the reprojection error and the human pose prior introduced in [4] jointly. It is
formulated by a Gaussian mixture (
l
;
l
) forl = 1; ; 8 on a dataset of diverse 3D
poses [48] in form of
minimize
X
X
xu
E
R
(X;x
u
) +E
P
(X);
subject to X2R
3C
;
(4.4)
where
E
P
(X) = log
(
8
X
l=1
N (Xj
l
;
l
)
)
: (4.5)
After we obtain reconstructed 3D poses, we will project them back to each frame
to generate a 3D supervision that will be used in the training of a 3D pose estimation
network from a single image as discussed in the next section.
4.2.2 Multi-Person 3D Pose Estimation from Single Image
System Overview
In this section, we study how to train a convolutional neural network (CNN) to solve
the problem of multi-person 3D pose estimation from a single image. To address this
problem, we represent estimated 3D poses with two complementary components: 1) 2D
poses and 2) keypoint depths. Our CNN architecture is shown in Fig. 4.4. It is a variant
of a state-of-the-art bottom-up human pose estimation network called PifPaf [59]. It has
three key modules:
• a backbone feature extractor (ResNet-152);
• three prediction modules; namely, Parts Intensity Fields (PIF), Parts Association
Fields (PAF), and Parts Depths Fields (PDF);
61
• a 3D pose encoder/decoder.
It is worthwhile to mention that PIF and PAF are the same as [59] while PDF is new.
Furthermore, most existing multi-person 3D pose networks, e.g., [4, 42], take multiple
frames as the input and perform some sort of 2D-to-3D operations before producing the
final 3D poses. In contrast, our network can be trained in an end-to-end manner.
Figure 4.4: The proposed CNN for multi-person 3D pose estimation from a single
image.
Key Components
We will discuss the roles and implementations of PIF, PAF, PDF, and the 3D pose
encoder and decoder here.
PIF is used to describe where the 2D keypoints are so that the body parts can be
localized. Different joints are processed independently. As far as the implementation is
62
concerned, there are 5 components for each location of jointj in the output map. They
arec, p
x
, p
y
, b and, wherec is the confidence score, (p
x
;p
y
) denote the coordinates
of the point that is closest to joint j, b is used to characterize the adaptive regression
loss for (p
x
;p
y
), and stands for the standard deviation of the Gaussian component at
location (p
x
;p
y
) when we recover the high-resolution confidence map of jointj from a
low-resolution output map produced by the network.
Figure 4.5: The skeleton representation of human poses in our framework.
PAF is used to characterize the link location of a skeleton, where links are used to
connect joints detected by PIF. The joints and links are associated together to form an
instance of a skeleton. In our skeleton representation as illustrated in Fig. 4.5, some
PAFs correspond to real bones while others do not. They are just some virtual connec-
tions between joints, e.g. the connection between the left ear and the left shoulder. Since
63
both PIF and PAF are the same as those in [59], we refer to the original paper for further
details.
PDF is used to regress the relative depth of joints. We use the relative depth because
of inherent scale ambiguity of 3D poses recovered from multiple 2D poses. Note that
3D poses used as the groundtruth are expressed in the relative scale. Being similar to
PIF, there are 2 PDF components for each location (i;j). They are d
ij
and
ij
. The
former represents the relative depth at location (i;j) while the latter denotes the radius
of the Gaussian component centered at the location of the closest joint, denoted by
(p
ij
x
;p
ij
y
). The actual value of the high resolution depth map at location (x;y) satisfies
the following:
D
xy
=
X
i;j
d
ij
N (x;yj(p
ij
x
;p
ij
y
);
ij
): (4.6)
The 3D pose encoder is used to encode three parts fields to the 3D poses while the
3D pose decoder is used to decode the 3D poses to three part fields. As to the 2D poses
plus the depth representation of 3D poses, we encoder/decode the 2D pose part in the
same way as that is given in [59].
Loss Function
Since the 3D pose encoder is not differentiable, we need to compute the loss between
encoded groundtruth parts fields and the predicted parts fields. The overall loss function
is given below:
L =L
PIF
+L
PAF
+L
PDF
: (4.7)
64
Because our camera poses are estimated through the SfM-based method, scale ambi-
guity is inevitable in our case. To solve the ambiguity problem, we use the following
relative depths loss in our network:
L
PDF
(d
1
;d
2
) =Var[log
d
1
d
2
]
=
1
N
log
d
1
d
2
2
1
N
2
X
log
d
1
d
2
2
:
(4.8)
Therefore, our estimated keypoint depths will try to fit the relative scale of the depth
label. It means that the estimated depths and the depth labels are the same if the ratio of
these two depths at all locations are consistent.
Length Scale Calibration
Although the predicted depths are relative, we can still make use of the prior information
hidden in human poses to have a rough estimation of the missing scale,s. We need to
finds such that the squared difference between the scaled bone lengthsl
i
and the average
bone length
l
i
is minimized as given below.
minimize
s
X
i
1
2
(sl
i
l
i
)
2
(4.9)
It is not difficult to show that Eq. (4.9) has the optimal solution at
s
=
P
i
l
i
l
i
P
i
l
2
i
: (4.10)
Network Training
Our proposed CNN for multi-person 3D pose estimation from a single image is shown
in Fig. 4.4. It can be trained by data with incomplete labels, e.g. images with 2D
65
supervisions only, human pose labels with some missing joints. This training flexibil-
ity comes from the pose encoder that can generate independent maps for PIF, PAF and
PDF separately. If some fields are missing in the ground truth labels, we can simply
set the weights of these fields to zero while keeping labels of other fields in the gradi-
ent computation process. When training images have 3D supervisions, we project the
3D skeleton information back to the view of each frame so that our network can learn
camera-independent multi-person 3D poses.
Before we train the proposed CNN with the Mannequin dataset, we first train it with
64,115 images in the 2017 COCO training set that have 2D pose annotations. In this
pre-training process, the PDF branch is not affected since the COCO dataset does not
have the depth information. The pre-training of 2D pose related fields can improve the
robustness of the ultimate 3D pose estimation network significantly.
In the network training with the Mannequin dataset, we apply roughly the same
data augmentation as that is done in [59], including random cropping into 95-100% of
the short edge of the original image, random horizontal flipping, etc. We use the SGD
optimizer with a learning rate at 0.001 and momentum at 0.95 with no weight decay,
while the batch size is 8.
4.3 Experiments
We first conduct experiments to verify the validity of the proposed MVM method quan-
titatively with the Mannequin dataset. Then, we do performance benchmarking with
two public datasets; namely, 3DPW and MSCOCO. The purpose is to show how addi-
tional training based on the 3D supervision offered by the proposed MVM method as
well as the Mannequin dataset helps improve the performance of the multi-person 3D
pose network.
66
Matching Method E
R
# Outliers# jG
k
j" E
R
w/o RANSAC# jG
k
j w/o RANSAC"
Baseline 33.1 12.1 6.5 N/A 18.7
Arnab et al. [4] 25.7 3.3 14.7 40.9 18.1
MVM 14.6 8.5 41.2 28.8 49.7
*Dong et al. [22] 9.6 2.9 10.5 22.3 13.5
*MVM 10.4 5.1 10.8 25.1 15.9
Table 4.1: Performance comparison between several matching algorithms, where *
means only evaluated on clips with less than 50 frames. Otherwise, it would take too
much time to run Dong’s algorithm. Smaller reconstruction errorE
R
implies more con-
sistent 3D poses. A larger number of corresponding 2D keypointsjG
k
j as well as a
smaller number of outliers means a more robust 3D reconstruction process and thus
potentially better 3D poses.
4.3.1 MVM Evaluation with Mannequin Dataset
We will demonstrate the effectiveness of the proposed MVM method using the Mannequin chal-
lenge dataset in both qualitative and quantitative ways. For the qualitative performance of gen-
erated 3D poses, some predicted examples are shown in Fig. 4.6. There are two people sitting
on the carpet with severe occluded poses in the first column of this figure. We see from this
example the capability of the proposed MVM method in recovering complicated poses. In some
scenarios, we may have a low confidence score of a predicted human pose (e.g. the orange girl
in the third column) or a small parallax angle to result in the failure of 3D reconstruction, (e.g.,
some part of the joints associated with the orange-and-pink person in the second column). The
MVM method tends to filter out such instances.
For quantitative evaluations, since the Mannequin dataset does not have any labels, we use
the following metric to quantify the validity of generated 3D poses: the reprojection errorE
R
of
estimated 3D poses with their corresponding 2D poses. If the average reprojection error is low
enough, it is reasonable to believe that quality of the generated 3D poses is good enough.
We compare the MVM method with three other methods in Table 4.1. They are:
1. a baseline that uses sequential matching;
2. a shortest path matching method [4];
67
3. the state-of-the-art optimization-based matching algorithm [22].
The triangulation is computed joint-wise. We also compare the effect of RANSAC in eliminating
outliers. The quantity,jG
k
j, in the table means the number of the 2D keypoints chosen for
triangulation.
As shown in Table 4.1, the proposed MVM method achieves comparable performance with
the state-of-the-art optimization method [22] on short clips at a much lower computational cost.
These video clips are short enough for the optimization method to converge within a reasonable
amount of time (e.g. an hour). For longer video clips, our MVM method outperforms both the
baseline method and the shortest path matching algorithm by a large margin.
Ave.E
R
Ave.jG
k
j
Shortest Path w/ MSE 25.7 14.7
Shortest Path w/ Geo Dist 20.5 22.6
Shortest Path w/ Geo Dist + Appearance 19.6 19.8
MVM w/ MSE N/A 15.8
MVM w/ Geo Dist 16.8 45.5
MVM w/ Geo Dist + Appearance (ours) 14.6 41.2
Table 4.2: Comparison among different affinity matrices.
Furthermore, we show how different similarity metrics affect the quality of generated 3D
poses in Table 4.2. It is clear from the table the proposed geometric distance contributes the
most performance gain in terms of the average reprojection error. As to the appearance similarity,
although it reduces the total number of matched 2D poses slightly, we can obtain the best overall
reprojection error by combining the geometric distance and the appearance similarity since their
integration introduces more constraints to matched poses.
4.3.2 3DPW Evaluation
Very few multi-person 3D human pose datasets are available to the public. One recently released
dataset, called the MPII-3DPW [113], contains 60 clips. It contains outdoor video clips captured
68
Figure 4.6: Visualization of the Mannequin dataset results: (1st row) original images
with 2D poses, (2nd row) generated 3D poses. The first column indicates a successful
estimation of 3D poses in the scene. Due to the low confidence score (e.g., the orange
girl in the third column) or a small parallax angle which can cause the failure of 3D
reconstruction (e.g., some part of the joints associated with the orange-and-pink person
in the second column), the MVM method filters out such instances in a scene.
by a mobile phone with 17 IMUs attached to the subjects. The IMU data allow people to accu-
rately compute 3D poses and use them as the ground truth. The test set consists of 24 video clips.
We use the 14 keypoints that are common in both MSCOCO and SMPL skeletons. The same
setting was also used in [4]. We evaluate on those frames that have enough visible keypoints
for 3D pose estimation and ignore subjects that have less than seven 2D visible keypoints. We
compute the Procrustes Aligned Mean Per Joint Position Error (PA-MPJPE) [4] independently
for each pose and, then, average errors for each tracked person in each video clip. This process
implies that we count video clips with two people twice, etc. Finally, we average over the entire
dataset.
Table 4.3 shows how incorporating additional data from our Mannequin dataset improves
results on 3DPW testing set. Training our proposed network with our data can improve the PA-
MPJPE by 3.5mm. Since 3DPW dataset has annotations for the full parametric body model,
SMPL model [72], it is expected to have better performance if the network can leverage the such
information. Thus, we compare the performance of HMR model trained in 3DPW dataset and
69
single frame PA-MPJPE (mm)#
Train w/ Our network 82.3
Train + Mannequin dataset w/ our network 78.8
Popa et al. [90] 108.2
SMPLify, Bogo et al. [7] 108.1
Train w/ HMR [52] 81.3
Train + Mannequin dataset w/ HMR [52] 78.2
Table 4.3: Evaluation of estimated 3D poses for the MPII-3DPW dataset. The upper
part indicates the results of bottom-up based methods and the lower part indicates the
results of top-down based methods
the performance trained in 3DPW and Mannequin dataset. The PA-MPJPE was improved by
3.1mm.
4.3.3 MSCOCO Evaluation
Besides facilitating 3D pose estimation, we are able to improve the accuracy of 2D pose estima-
tion by training a learning system using our Mannequin dataset. Table 4.4 provides quantitative
results for the COCO dataset. The evaluation metrics in COCO dataset are based on the mean
average precision (AP) over 10 object keypoint similarity (OKS) thresholds as the main com-
petition metric [69]. The OKS score can measure the similarity between two 2D poses, which
essentially serves as a similar role like what IoU does in object detection or segmentation. Our
dataset can improve the baseline performance by 1.9 in mAP. The training benefits from the
additional data in our dataset which has a lot of weird poses. Apparently, it is more likely for
people to make strange and challenging poses in the shooting of a video clip to be uploaded to
the Youtube. As a result, our dataset helps recover difficult cases.
mAP@OKS" AP@OKS=0.5"
Train 64.6 85.9
Train + Mannequin dataset 66.5 87.1
Table 4.4: Evaluation of estimated 2D poses in the COCO validation set.
70
These experiments show how one can effectively use the Mannequin dataset to improve
the per-frame 2D pose model in multiple datasets. Besides, in Table 4.5, we computeC
var
to
empirically compare the consistency of the initial 2D pose estimator. As is defined in Equation
4.11,C
var
stands for the variance of pairwise triangulated 3D keypoints of 2D pose predictions,
C
var
(fx
c
u
g) =
1
M(M1)=2
X
(F
tri
(x
c
u
;x
c
v
)
c
X
)
2
; (4.11)
where is the threshold to eliminate outliers. As shown in the table, OpenPifPaf can produce a
more consistent set of 2D poses. This is the reason we used OpenPifPaf as the default 2D pose
estimator.
2D pose network C
var
@ = 0:5 C
var
@ = 0:9
Mask-RCNN [36] 581.8 233.6
OpenPifPaf [59] 351.0 185.7
Table 4.5: Comparison between two 2D pose estimation networks, where we report
the variance, C
var
, of the pairwise triangulated 3D poses, which is used to measure
consistency of predicted 2D poses, and is the threshold to eliminate outliers.
4.4 Conclusion
An MVM method was proposed to generate reliable 3D supervisions from unlabeled action-
frozen video in the Mannequin dataset so as to improve the estimation accuracy of multi-person
2D and 3D poses. The MVM method attempts to match 2D human poses estimated across multi-
ple frames. The key to efficient matching lies in taking the geometric constraint (of static scenes
in the Mannequin dataset) and appearance similarity into account jointly. Afterwards, through
the triangulation of a group of matched 2D poses, optimization by considering the human pose
prior and re-projection errors jointly and bundle adjustment, we can obtain reliable 3D supervi-
sions. These 3D supervisions are used to train a multi-person 3D pose network. It was demon-
strated by experimental results that both 3D pose estimation accuracy (tested for the 3DPW
dataset) and 2D pose estimation accuracy (tested for the MSCOCO dataset) can be improved.
71
Chapter 5
Self-supervised Representation
Learning
5.1 Introduction
The text information in images such as optical character recognition (OCR) signals is intuitively
useful to tasks such as classification of document-related image categories, e.g., menus, business
cards, street names in maps, store/hotel/park names, etc. There have been a considerable number
of papers on text detection [68, 75, 70, 91, 116]; namely, extracting the text information from an
image accurately. After text detection, the immediate follow-up tasks is text recognition, which
is the task of an OCR engine. Yet, there is less work on leveraging extracted text data for image
understanding. In this work, we aim at exploiting the extracted text information to improve the
accuracy of two image understanding tasks: 1) image tag prediction and 2) functional image
classification.
For the first task, learning to predict user tags of an image is an extremely challenging task
because of the great diversity of images associated with the same tag. However, with the assis-
tance of textual labels extracted from images, the task can become easier, as illustrated in Figure
5.1 (first row). This is particular true if the semantics provided by textual labels is well correlated
with the downstream task, say, the categorization of texty images, where image categories and
text information are somehow related.
For the second task, functional labels stand for entities associated with printings that possess
certain functionalities, e.g. menus, posters, newspapers, receipts, passports, notes, bank cards,
72
Figure 5.1: Example images of the two tasks: image tag prediction (first row) and func-
tional image classification (second row). Underlined tags in the first row can be also
found in OCR results.
personal IDs, etc. Example images are shown in Figure 5.1 (second row). Functional image clas-
sification has real-world applications such as intelligent business, augmented reality, unmanned
grocery store, etc. Inspired by recent success of visual representation learning in many general-
purpose computer vision applications, including image classification, object detection and even
semantic segmentation, it is advantageous to train a representation network targeting at images
with the textual information (e.g., OCR signals).
The rationale of using text data as supervision is that images with similar texts tend to be
semantically related; namely, they have common visual contents. Thus, we can train a represen-
tation model that learns image contents to capture information like “what images are likely to
have word ‘menu’ on it”, as illustrated in Figure 5.2. The representation network can be trained
on a large number of images that contain textual labels with an objective of capturing the visual
similarity among images of similar keywords and improving their representation capability. The
information distilled from such a representation can be potentially useful for the classification of
texty images.
Our current work is set at a distance from many existing representation learning models.
The focus of our work is different in two main aspects. First, we aim at exploring the benefit
73
of a special kind of visual information: image OCR results, which is good for certain kinds of
images but is also naturally irrelevant to many other kinds of images. As a result, our goal is
to build a domain-specific classifier instead of a general purpose model. Our proposed method
is evaluated in two tasks, image tag prediction and functional image classification. Specifically,
the former depends on noisy user generated information but is of larger scale, while the latter
has reliable groundtruth verified by human annotators but is of smaller scale. Second, our model
is based on pre-computed visual features which does not apply any finetuning of deep neural
networks. Our settings can significantly relieve the requirement of computing resources during
the training stage of a model. We believe that such settings could have more applications in the
future because of its simplicity and efficiency.
Our method, named SSRIT (self-supervised representation learning of images with texts),
exploits textual signals in a self-supervision manner. SSRIT takes pre-computed visual signals as
the input so that it can be trained without any GPU requirement. It is shown by experiments that
SSRIT contributes to functional image classification and image tag prediction. With experiments
on YFCC100M [109] (image tag prediction) and FunctionalFlickr (functional image classifica-
tion), we observe that the accuracy of a baseline model can be significantly improved by SSRIT
and SSRIT outperforms baseline state-of-the-art methods in both.
There are three main contributions of this study as elaborated below.
• We propose a simple yet effective representation learning framework, called SSRIT,
whose training benefits from the use of textual data that are associated with the training
images naturally.
• We obtain consistent performance improvement with the SSRIT model regardless whether
the textual signals are present in test images or not.
• Our proposed model is built on top of pre-computed visual signals, which can significantly
reduce the computation resources needed during the training stage of the model.
74
Figure 5.2: Our framework is based on the assumption that images that share the same
detected textual keywords (e.g. images with word “menu”) are semantically similar.
Thus, we can train a representation model that learns image contents to capture infor-
mation like “what images are likely to have word ‘menu’ on it” and use such information
to improve downstream models.
5.2 Methodology
Similar to other representation learning methods, our SSRIT method consists of two stages:
upstream training and downstream evaluation. The representation network is first trained using
a large-scale upstream dataset. Each image is associated with pre-computed visual and text
signals. Under the assumption that the visual information is well-captured by a baseline model,
SSRIT does not touch raw image bytes any longer. In the following, we first introduce the
baseline downstream model in Sec. 5.2.1. Then, we propose the SSR training in the upstream
pipeline and show how to incorporate SSR in downstream task evaluation in Sec. 5.2.2.
5.2.1 Input Signals and Downstream Baseline
The input to SSRIT may have two sources: visual and textual signals.
Visual Input. Visual signals can be the last-layer responses of a pre-trained CNN (e.g.
“res5c” for ResNet-50 [93]) or features generated by a visual model. We use pre-computed
feature responses of the input image for the following three reasons.
1. The requirement of computing resources can be significantly relieved. Training of the
whole visual feature network on a large dataset is extremely time-consuming. This is
75
usually impractical on a regular personal computer (e.g. finetuning a ResNet-50 on 100M
images).
2. The complexity of the system can be significantly simplified. A simple system that works
reasonably well could be a good fit for a domain-specific classification task.
3. The design of the system architecture can be more flexible since it can accommodate any
type of visual features.
Algorithm 5: Pseudo-codes for textual label extraction.
Input : A list of wordsRW representing the raw textual output of an image,
K regular expressionsKW representingK keywords.
Output: Extracted textual label O.
fork = 1:::K do
ifKW
i
matches at least one word inRW then
O
i
1 ;
else
O
i
0 ;
end
end
Textual Input. Considering the fact that different OCR engines may have different textual
representations, the textual input is kept in form of a list of words, thus making textual input
consistent in a wide range of application domains. After that, we apply a filter to raw textual
responses so as to obtain a subset of keywords that are likely to be related to the target labelset,
i.e. text related entities. These three steps are detailed below.
1. Generate raw textual output from an OCR engine
The raw textual output is represented as a list of extracted words.
2. Extract textual labels from raw textual output
Extracting textual labels can be formulated as aKdimensional vectorO, whereO
i
is a
binary variable indicating whether thei-th keyword is detected in the raw textual output.
The process of extracting OCR labels from raw textual output is described in Algorithm
5.
76
Algorithm 6: Pseudo-codes for selecting textual keywords.
Input : All raw textual outputs of the training setRW (n),n = 1; ;N,
word similarity functionsyn(word
1
;word
2
)2 [0; 1], target words
T
i
(i = 1::C)
Parameter: Word similarity threshold
Output : Selected keyword expressionsKW
i
,i = 1K.
Compute word countF (:) for all textual words inRW (n) ;
Sort words based onF (:) in a decreasing order intoD;
KW ; ;
ford2D do
syn(d;T ) max
i=1C
syn(d;T
i
) ;
ifsim(d;T )> then
KW KW[d ;
end
ifKW =K then
break ;
end
end
3. Select keywords from textual labels
The selection ofK keywords from textual labels is based two factors: 1) semantic con-
nection to downstream classification targets, and 2) frequency in the training set. For
example, if we want to detect “menu” images, words such as “appetizer”, “beef” or even
the word “menu” itself are considered to be relevant while words like “cat” or “art” are
not. The procedure is described in Algorithm 6. Note that this approach is limited to the
languages that appear in the training data.
Downstream Baseline. The downstream baseline for image classification with raw textual
input is shown in Fig. 5.3. The input contains both image and textual data. They are fed into
a simple classifier, which is a linear neural network consisting of two cascaded fully-connected
layers, which is essentially a two-layer MLP, to predict the class of the input image.
77
Figure 5.3: The downstream baseline for image classification with both visual and tex-
tual inputs.
5.2.2 Self-Supervised Representation (SSR)
Upstream Training. The pipeline of SSR upstream training is shown in Fig. 5.4, where the
green box denotes the self-supervised representation (SSR) network and its output is the desired
SSR. The SSR network is a shallow neural network whose architecture is basically the same
as the linear-head downstream model while its objective is replaced by textual signals extracted
from the same image. Its goal is to relate the hidden visual information of the visual input with its
textual input in form of keywords. We denote the training set byD =f(x
n
;y
n
)g,n = 1; ;N,
with D-dimensional visual inputx2R
D
and the associated textual labely2f0;1g
K
, whereK
is the number of keywords and the value of0 and1 means thek-th keyword exists or not.
We use function f(x;) 2 R
E
to indicate the response of the intermediate layer as an
embedding from inputx by a convolutional network with parameters. The predicted probabil-
ity of ^ y2f0;1g
K
given inputx is defined as sign(W
T
f(x;)), whereW is anEK matrix.
The parameters andW are jointly optimized to minimize a multi-class logistic loss, which is
the negative sum of the log-probabilities over all positive labels. When the numbers of images in
78
Figure 5.4: Self-supervised representation training with textual input.
different classes are highly unbalanced, we need to introduce an adaptive class weight to address
this problem. Finally, the probabilities are computed using a softmax layer:
l(;W;D) =
1
N
N
X
n=1
K
X
k=1
y
nk
log
"
exp
w
k
f(x
n
;)
P
K
j=1
exp
w
j
f(x
n
;)
#
: (5.1)
Downstream Training and Evaluation. The training and evaluation of an improved down-
stream pipeline using SSR features learned from textual signals is shown in Fig. 5.5, where the
gray box denotes the improved downstream classifier. The input to the gray box includes visual
features, textual features (or called sparse OCR features), and SSR features. The classifier can
have the same architecture as the one used in the baseline downstream model with a large input
dimension. It can be trained using the same output label. The difference between Figs. 5.3 and
5.5 is that the latter has SSR features as the additional input.
5.3 Experiments
Experimental results are given in this section. We first discuss the experimental design in prepar-
ing suitable datasets in Sec. 5.3.1. Then, we assess the effectiveness of the trained SSR features
79
Figure 5.5: Training and evaluation of an improved downstream pipeline using SSR
features as additional input.
by performing experiments in image tag prediction and functional image classification two tasks.
The experimental details are presented in Secs. 5.3.2 and 5.3.3, respectively. Afterwards, we pro-
vide some ablation study results in Sec. 5.3.4. More discussion on the limitation of SSRIT is
discussed in Sec. 5.3.5.
5.3.1 Data Preparation
We build a special subset of tags that are relevant to our objective and evaluate different methods
for this subset. This setting is useful in understanding the type of images that can benefit from
the textual information. Unlike a general purpose classification model that needs to learn a wide
range of knowledge from training data, the objective of our SSRIT method is to make the best
use of the textual information. Admittedly, SSRIT may not provide a performance gain for a
general-purpose image classification task (e.g., object classification for the Imagenet). For data
preparation, we use the co-occurrence count in selecting relevant images. That is, we count the
number of images that have the exact text description of tags detected in the OCR output and
select top C tags as our target subset in the experiment. More details are given in Algorithm 7.
80
Figure 5.6: Examples of images with high OCR scores in three rows: calendar, menu
and passport. The boxed images in red color indicate misleading images with detected
OCR keyword but are not semantically related.
5.3.2 Image Tag Prediction
Image tags are used to describe the content of images. The quality of SSR features is measured
by comparing the accuracy of predicted image tags in the YFCC100M dataset with several other
baseline methods in Table 5.1. By following [51], we use precision at k, denoted by P@k, to
measure the accuracy of predicted tags. We have the following observations from this table.
1. SSR features can capture the most “useful” information in the original visual signals.
As a result, combining SSR and the original visual signals may increase the redundancy
and thus causes worse performance than the model using SSR only. As illustrated in
Figure 5.8, the orange bar (SSR only) is slightly better than the green bar (combining SSR
and visual signals). This comparison suggests that our SSR can produce a meaningful
81
Algorithm 7: Pseudo-codes for selecting the target subset.
Input : OCR responses of datasetRW (n),n = 1; ;N, tags of the
datasetTag(n);n = 1; ;N, Number of target tagsC
Output : Target tagsT
i
(i = 1::C).
Initialize co-occurrence count arraycnt to be all zeros ;
fori = 1:::N do
forw2Tag(i) do
ifw2RW (i) then
cnt(w) cnt(w) + 1 ;
end
end
end
Sortcnt in descending order ;
Outputw with topC co-occurrence countcnt.
representation from the original visual signals that can help improve the performance in
the OCR-related tag prediction task.
2. SSR features cannot fully cover the information in OCR signals, though, SSR can provide
some necessary semantic information that is described by the direct OCR signals. As
illustrated in Figure 5.7, the performance gain by putting together OCR signals is slightly
better than the model that only relies on SSR (blue bar higher than the orange bar). Still,
we can get even better performance when combining SSR with OCR signals.
3. BiT achieves consistently better performance than R50 in this image tag prediction task.
The overall best performance (P@1 of 58.8 and P@10 of 60.1) is obtained by “O+S”
trained on BiT signals.
5.3.3 Functional Image Classification
The functional image classification task is evaluated in FunctionalFlickr dataset, which consists
of around 300k functional images labeled by human annotators. Each image is associated with
one or more labels that represent different types of “functional” images. “Functional labels”
82
P@1 P@10
Input Signals R50 BiT R50 BiT
V 23.6 25.8 54.7 58.3
T+V 45.9 48.4 79.3 82.6
S 43.1 44.6 75.2 80.8
S+V 42.4 43.7 74.2 79.1
T+S 58.8 60.1 84.1 88.4
T+S+V 57.5 58.3 83.6 87.5
Joulin et al. (Pre-trained) 16.6 35.9
Table 5.1: Comparison of image tag prediction results, where “V”, “T” and “S” stand
for visual signals, OCR signals and SSR signals obtained by SSRIT, respectively. Two
kinds of pre-computed visual signals are tested. They are ResNet-50 and BiT-M (BiT).
The best and second-best results are shown in bold and with a underline, respectively.
Figure 5.7: Comparison of the image tag prediction performance between models with
different usage of textual signals, where ‘V”, “T” and “S” stand for visual signals, textual
signals and SSR signals, respectively.
basically stand for 65 common categories of document-like images, e.g. menu, poster, passport,
etc. In this task, we use area-under-precision-recall-curve (auPR) as the target metric.
The results are shown in Table 5.2. From this table we can have three observations:
1. The overall relationship between different combinations of signals is roughly the same
as the experiments in image tag prediction tasks as discussed in 5.3.2: (1) “S” covers
83
Figure 5.8: Comparison of the image tag prediction performance between models with-
out direct usage of textual signals. We may assume our trained SSR signals captures
most “useful” information on the original visual signals from the observation that orange
bar (SSR signals only) is better than the green bar (combining SSR signals and visual
signals).
most semantic information from “V”, (2) “T” offers the biggest performance gain and (3)
“T+S” achieves the overall best performance.
2. Our model can robustly work well in various settings under different evaluation metrics.
Input Signals auPR"
V .251
T+V .543
S .461
S+V .469
T+S .556
T+S+V .541
Table 5.2: Functional image classification results. “V”, “T” and “S” stand for visual
signals, textual signals and SSR signals, respectively. BiT is used as visual signals. The
best and second-best results are shown in bold and with a underline, respectively.
We evaluate performance of our proposed SSR on human-verified benchmarks. Previous
evaluation on image tag prediction task may not be strong enough to support our claim on the
benefit of the proposed model because we do not have the actual groundtruth for this task. The
84
evaluation in functional image classification task is conducted in a large-scale dataset with reli-
able groundtruth. The domain of this task highly matches our model with OCR information.
5.3.4 Implementation Choices
Choice of Representation Network. To show how different choices of the representation net-
work architectures affect the performance, we present experimental results in Table 5.3. As
shown in this table, doubling network size from 128 to 256 can lead to a significant perfor-
mance gain, while little improvement can be obtained from 256 to 512. In order to reduce the
computation cost, we believe that 256 is a reasonable choice for the network size.
Choice of SSR Extraction We compare two different choices of extracting trained SSR
features. As shown in Table 5.4, it is clear that the response of the second last layer of the
trained representation network gets the best performance.
Representation Network Architecture auPR"
128 .460
256 .486
512 .487
256x256 .482
Table 5.3: FunctionalFlickr results on different representation network architecture.
Feature Extraction auPR"
last .412
penultimate .461
Table 5.4: FunctionalFlickr results on different extraction choices of the representation
features.
5.3.5 Limitations
We present two potential natural limitations of our method, which we believe are also limitations
of OCR information itself.
85
1. As a pretty special kind of visual information, OCR signals can help understand some
visual entities. However, admittedly, it is not helpful for most general purpose classifi-
cation model that needs to learn a wide range of knowledge from training data. We also
perform benchmarking in imagenet-1k classification task (ILSVRC14 split [16]) but do
not find any improvement when applying SSR signals to the downstream network. The
results are shown in Table 5.5. The performance drops after introducing our proposed
SSR features.
2. OCR information may be misleading in some cases. This is because the OCR semantics
may not perfectly aligned with the downstream classification semantics. As shown in
Figure 5.6, images with the same detected OCR keywords, e.g. the boxed images with
word “menu” and “passport” may be misleading.
Input Signals top-1 acc"
BiT (frozen) 75.6
BiT (frozen) + SSR 74.8 (-0.8)
Table 5.5: ImageNet-1k classification performance.
5.4 Conclusion
A representation learning framework, SSRIT, is proposed in this work. SSRIT exploits OCR
signals in a self-supervised manner. Our proposed model can learn the implicit visual pattern
of images with similar detected keywords and thus help to improve the downstream classifiers.
Through the experiments in YFCC100M (for image tag prediction task) and FunctionalFlickr
(for functional image classification task) dataset, we show that it is beneficial to introduce textual
information to both tasks.
86
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
In this dissertation, we focus on three problems that are related to the topic of reducing the label-
ing cost: active learning, weakly supervised human pose estimation and representation learning
for images with texts.
In active learning part, we proposed two methods: k-covers and TBAL.
In k-covers, we proposed to address the batch redundancy problem for uncertainty based
AL startegies. The proposed k-covers can enhance the similarity within a cluster by minimizing
the maximum difference. Our empirical results suggest that, in MNIST and CIFAR-10 datasets,
the performance of our proposed method can outperform the state-of-the-art with a large margin
when the initialization is class unbalanced. When the initialization is class balanced, our method
can still slightly outperform the state-of-the-art.
In TBAL, we proposed to address the drawbacks on k-covers on the lack of using unla-
beled data information and potential defect when the batch size is small. Extensive experiments
show that our method can outperform state-of-the-art active learning methods in various set-
tings including different network architectures, different target classes as well as different loss
functions. In the first stage of TBAL, a consistency loss is introduced so that our model can
learn from both labeled and unlabeled data. In the second stage, our query strategy integrates
MCVarr and the new CRB technique so as to obtain a batch of informative and diverse sam-
ples. Through extensive experiments, we demonstrated the following two main advantages of
the proposed TBAL method. First, it selects more informative samples in both semi-supervised
and supervised settings for better accuracy. Second, it provides consistent improvement over a
87
number of methods that ignore batch redundancy. In particular, it has a large performance gain
for CUB-200-2011, which is a fine-grained image classification dataset.
In the weakly supervised human pose estimation problem, MVM method was proposed to
generate reliable 3D supervisions from unlabeled action-frozen video in the Mannequin dataset
so as to improve the estimation accuracy of multi-person 2D and 3D poses. The MVM method
attempts to match 2D human poses estimated across multiple frames. The key to efficient match-
ing lies in taking the geometric constraint (of static scenes in the Mannequin dataset) and appear-
ance similarity into account jointly. Afterwards, through the triangulation of a group of matched
2D poses, optimization by considering the human pose prior and re-projection errors jointly and
bundle adjustment, we can obtain reliable 3D supervisions. These 3D supervisions are used to
train a multi-person 3D pose network. It was demonstrated by experimental results that both 3D
pose estimation accuracy (tested for the 3DPW dataset) and 2D pose estimation accuracy (tested
for the MSCOCO dataset) can be improved.
In the problem of representation learning on images with texts, a representation learning
framework called SSRIT was proposed. SSRIT exploits OCR signals in a self-supervised man-
ner. Our proposed model can learn the implicit visual pattern of images with similar detected
keywords and thus help to improve the downstream classifiers. Through the experiments in
YFCC100M (for image tag prediction task) and FunctionalFlickr (for functional image classifi-
cation task) dataset, we show that it is beneficial to introduce textual information to both tasks.
88
6.2 Future Research Directions
The general research idea about reducing labeling cost can lead to two potential directions: more
informative samples selected to get labeled and more information extracted from unlabeled data
with no cost. In these two directions, we bring up the following research problems:
• Quadratic programming based active learning strategy.
• Weak 3D Information Fusion on Frozen Videos.
• General Purpose Text Supervised Representation Learning.
6.2.1 Quadratic Programming Based Active Learning
We would like to explore a theoretically sound active learning strategy to combine the uncertainty
and diversity and analyze the batch redundancy problem in a quantative way. One potential
direction is to make use of the integer quadratic programming models.
The goal of active learning is to choose M images from N unlabeled images, while we
haveL labeled images. Specifically, we want to maximize the diversity while also maximize the
uncertainty in the chosen batch. So, we can write our objective function as:
minimize
c2f0;1g
N
^
P
i
c
i
=M
c
T
Ac
M
2
+
w
T
l
c
M
+
w
T
uc
c
M
; (6.1)
wherec is a zero-one variable indicating the choice of the batch, the first term is the diversity loss
within the batch, the second term is the distance loss between the batch and the current labeled
set and the third term is the uncertainty loss within the batch.
Objective function in Equation 6.1 can be categorized into mixed-integer quadratic program-
ming (MIQP). Objective functions of that form have a longstanding history, which can be used to
express many portfolio optimization problems, first shown by Markowitz [78]. Moreover, such
objective function can be efficiently solved by cutting plane method [53]. Then, with the help of
Gurobi [34], a commonly used optimization software, we can efficiently solve our objective.
89
Cutting plane method [53]. The core idea of this method is to reduce (mixed-)integer
quadratic programming (MIQP) problem into a (mixed-)integer linear programming (MILP)
problem. There are two main steps for the reduction:
1. Reduce quadratic objective to “linear objective with quadratic constraints” by lettingz
c
T
Ac and then minimizing onz.
2. Approximate quadratic constraints with a group of linear constraints using Taylor’s first
order approximation.
(a) Convergence of slack variablez. (b) Convergence of the objective function.
Figure 6.1: Solving MIQP with cutting plane method. It can converge in 10 iterations.
After such reduction, we can use linear integer programming solver software to solve it
iteratively until the reduced linear constraints converge to its original quadratic form.
An example of running cutting plane method to solve our objective function is shown in
Figure 6.1. In practice, cutting plane method usually converges very fast.
Some preliminary results in MNIST is shown in Figure 6.2. I tried three groups of parameters
where is set to 1, 3 and 10. Since controls the weight of uncertainty term, we fount that if
is too small, the performance will drop a lot. When is too large, our objective will degenerate
to max-entropy strategy. Then, an in-the-middle choice of can slightly outperform the baseline
entropy method.
90
Figure 6.2: Active learning curve of our IQP formulation.
There are still a lot of works that need to be done through this direction. For example, the
naive form of quadractic programming does not scale well when the unlabeled pool is too large
because at leastO(n
2
) memory is needed, which is already too heavy a burden. Is it possible to
have some reasonable approximation method to handle the large scale case? Further and deeper
investigation in this direction needs to be done.
6.2.2 Weak 3D Information Fusion on Frozen Videos
3D labels are usually very expensive to acquire. Thus, making improvement in better utilization
of the hidden 3D information in a scene can potentially reduce the labeling cost for applications
that need 3D labels. In our previous work about weakly supervised human pose estimation
problem, we explore the a large-scale dataset of action-frozen people videos and proposed a
relatively straightforward way of using such data under the strong assumption of roughly static
scene. We believe that it is possible to estimate some useful 3D information from it. For example,
we can try to extract information such as 3D surfaces, textures, etc. In the static scene setting, it
is feasible to obtain reliable estimates of various scene information with no manual labels.
• Utilizing weak 3D supervisions obtained from the Mannequin dataset. Although the 3D
poses generated by MVM method are accurate, we may have an incomplete number of
joints. This is caused by insufficient parallax (viewing angle). To give an example, the
91
Figure 6.3: Rich information provided by 3D reconstruction software COLMAP. Cur-
renlty, only the estimated camera poses are utilized, which is potentially a waste of
resources.
right ear might be missing since it is not seen in input frames. One might “inpaint”
incomplete pose supervisions to obtain a pseudo groundtruth.
• Utilizing 3D point cloud information when running COLMAP. COLMAP can be used
to estimate camera poses and the 3D point cloud information. Currently, we only use
the camera pose information. The 3D point cloud set may provide some extra useful
information for better 3D pose estimation.
6.2.3 General Purpose Text Supervised Representation Learning
Images are sometimes associated with accompanying texts (not necessarily limited to the text
in the image) that can potentially help the understanding of image content. For example, the
caption of an image could be a strong hint of what this image is about. Depending on the source
of collected images, the format of associated texts may vary a lot. The diversity of textual
information makes it an extremely challenging problem but also sparks the potentials of learning
a strong general purpose representation. Even though we can essentially collect infinite number
92
of images from the internet with some texts in the same web page, it is very hard to utilize the
raw textual information associated with images.
Figure 6.4: Captioned images from MSCOCO [112] (left) and TextVQA [104] (right).
A potential further research direction is to figure out a way to efficiently characterize the
“semantics” of text and use it to train a general purpose representation network. There are at
least three challenges of this direction: 1) the collection of large scale dataset of images with
captioning texts, 2) the identification of related texts and their characterization method (e.g.
embeddings generated by BERT [18]) and 3) the design of representation network that can be
supervised by the textual embedding of some form.
93
Bibliography
[1] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall, and B. Schiele.
Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 5167–5176, 2018.
[2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape
completion and animation of people. In ACM SIGGRAPH 2005, pages 408–416. 2005.
[3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa:
Visual question answering. In Proceedings of the IEEE international conference on com-
puter vision, pages 2425–2433, 2015.
[4] A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose
estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 3395–3404, 2019.
[5] J. Azimi, A. Fern, X. Z. Fern, G. Borradaile, and B. Heeringa. Batch active learning via
coordinated matching. In Proceedings of the 29th International Coference on Interna-
tional Conference on Machine Learning, pages 307–314, 2012.
[6] W. H. Beluch, T. Genewein, A. N¨ urnberger, and J. M. K¨ ohler. The power of ensembles
for active learning in image classification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 9368–9377, 2018.
[7] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl:
Automatic estimation of 3d human pose and shape from a single image. In European
Conference on Computer Vision, pages 561–578. Springer, 2016.
[8] L. Bridgeman, M. V olino, J.-Y . Guillemaut, and A. Hilton. Multi-person 3d pose estima-
tion and tracking in sports. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, pages 0–0, 2019.
[9] K. Brinker. Incorporating diversity in active learning with support vector machines. In
Proceedings of the 20th international conference on machine learning (ICML-03), pages
59–66, 2003.
[10] Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh. Realtime multi-person 2d pose estimation
using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 7291–7299, 2017.
94
[11] S. Chakraborty, V . Balasubramanian, and S. Panchanathan. Adaptive batch mode active
learning. IEEE transactions on neural networks and learning systems, 26(8):1747–1760,
2014.
[12] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, and J. Ye. Batch mode
active sampling based on marginal probability distribution matching. ACM Transactions
on Knowledge Discovery from Data (TKDD), 7(3):1–25, 2013.
[13] X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In Pro-
ceedings of the IEEE International Conference on Computer Vision, pages 1431–1439,
2015.
[14] F. Chollet. keras. https://github.com/keras-team/keras, 2018.
[15] H. Ci, C. Wang, X. Ma, and Y . Wang. Optimizing network structure for 3d human pose
estimation. In Proceedings of the IEEE International Conference on Computer Vision,
pages 2262–2271, 2019.
[16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee, 2009.
[17] E. Denton, J. Weston, M. Paluri, L. Bourdev, and R. Fergus. User conditional hashtag
prediction for images. In Proceedings of the 21th ACM SIGKDD international conference
on knowledge discovery and data mining, pages 1731–1740, 2015.
[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[19] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-
supervised visual concept learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3270–3277, 2014.
[20] A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flow fields for multi person
tracking. arXiv preprint arXiv:1805.04596, 2018.
[21] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan,
K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recog-
nition and description. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2625–2634, 2015.
[22] J. Dong, W. Jiang, Q. Huang, H. Bao, and X. Zhou. Fast and robust multi-person 3d pose
estimation from multiple views. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 7792–7801, 2019.
[23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner,
M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
95
[24] E. Elhamifar, G. Sapiro, A. Yang, and S. Shankar Sasrty. A convex optimization frame-
work for active learning. In Proceedings of the IEEE International Conference on Com-
puter Vision, pages 209–216, 2013.
[25] J. Fan, Y . Shen, N. Zhou, and Y . Gao. Harvesting large-scale weakly-tagged image
databases from the web. In 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 802–809. IEEE, 2010.
[26] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from
google’s image search. In Tenth IEEE International Conference on Computer Vision
(ICCV’05) Volume 1, volume 2, pages 1816–1823. IEEE, 2005.
[27] K. Fischer, B. G¨ artner, and M. Kutz. Fast smallest-enclosing-ball computation in high
dimensions. In European Symposium on Algorithms, pages 630–641. Springer, 2003.
[28] Y . Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by
committee algorithm. Machine learning, 28(2-3):133–168, 1997.
[29] Y . Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. In
International Conference on Machine Learning, pages 1183–1192, 2017.
[30] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-track: Efficient
pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 350–359, 2018.
[31] Y . Guo. Active instance sampling via matrix partition. In Advances in Neural Information
Processing Systems, pages 802–810, 2010.
[32] Y . Guo, Y . Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual
understanding: A review. Neurocomputing, 187:27–48, 2016.
[33] Y . Guo and D. Schuurmans. Discriminative batch mode active learning. In Advances in
neural information processing systems, pages 593–600, 2008.
[34] L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.
[35] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
[36] K. He, G. Gkioxari, P. Doll´ ar, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages 2961–2969, 2017.
[37] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, pages 1026–1034, 2015.
[38] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778, 2016.
96
[39] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In
European conference on computer vision, pages 630–645. Springer, 2016.
[40] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application
to medical image classification. In Proceedings of the 23rd international conference on
Machine learning, pages 417–424, 2006.
[41] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Semisupervised svm batch mode active learning
with applications to image retrieval. ACM Transactions on Information Systems (TOIS),
27(3):1–29, 2009.
[42] F. Huang, A. Zeng, M. Liu, Q. Lai, and Q. Xu. Deepfuse: An imu-aware network for real-
time 3d human pose estimation from multi-view image. arXiv preprint arXiv:1912.04071,
2019.
[43] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convo-
lutional networks. In CVPR, volume 1, page 3, 2017.
[44] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and
B. Schiele. Arttrack: Articulated multi-person tracking in the wild. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 6457–6465, 2017.
[45] U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-person pose estimation and track-
ing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2011–2020, 2017.
[46] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann. Deep classi-
fiers from image tags in the wild. In Proceedings of the 2015 Workshop on Community-
Organized Multimodal Mining: Opportunities for Novel Solutions, pages 13–18, 2015.
[47] C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. V . Le, Y . Sung, Z. Li, and
T. Duerig. Scaling up visual and vision-language representation learning with noisy text
supervision. arXiv preprint arXiv:2102.05918, 2021.
[48] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. Godisart, B. Nabbe,
I. Matthews, et al. Panoptic studio: A massively multiview system for social interaction
capture. IEEE transactions on pattern analysis and machine intelligence, 41(1):190–204,
2017.
[49] A. J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image
classification. 2009.
[50] A. J. Joshiy, F. Porikli, and N. Papanikolopoulos. Multi-class batch-mode active learning
for image classification. In Robotics and Automation (ICRA), 2010 IEEE International
Conference on, pages 1873–1878. IEEE, 2010.
[51] A. Joulin, L. Van Der Maaten, A. Jabri, and N. Vasilache. Learning visual features from
large weakly supervised data. In European Conference on Computer Vision, pages 67–84.
Springer, 2016.
97
[52] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human
shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7122–7131, 2018.
[53] J. E. Kelley, Jr. The cutting-plane method for solving convex programs. Journal of the
society for Industrial and Applied Mathematics, 8(4):703–712, 1960.
[54] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[55] A. Kirsch, J. van Amersfoort, and Y . Gal. Batchbald: Efficient and diverse batch acqui-
sition for deep bayesian active learning. In Advances in Neural Information Processing
Systems, pages 7024–7035, 2019.
[56] M. Kocabas, N. Athanasiou, and M. J. Black. Vibe: Video inference for human body pose
and shape estimation. arXiv preprint arXiv:1912.05656, 2019.
[57] A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby. Big
transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16,
pages 491–507. Springer, 2020.
[58] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis. Learning to reconstruct
3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2252–2261, 2019.
[59] S. Kreiss, L. Bertoni, and A. Alahi. Pifpaf: Composite fields for human pose estimation.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[60] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
Technical report, Citeseer, 2009.
[61] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. arXiv preprint
arXiv:1610.02242, 2016.
[62] T. Larsson. Fast and tight fitting bounding spheres. In SIGRAD 2008. The Annual
SIGRAD Conference Special Theme: Interaction; November 27-28; 2008 Stockholm;
Sweden, number 034, pages 27–30. Link¨ oping University Electronic Press, 2008.
[63] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V . Gehler. Unite the people:
Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 6050–6059, 2017.
[64] Y . LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T Labs
[Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
[65] A. Li, A. Jabri, A. Joulin, and L. van der Maaten. Learning visual n-grams from web
data. In Proceedings of the IEEE International Conference on Computer Vision, pages
4183–4192, 2017.
98
[66] X. Li and Y . Guo. Adaptive active learning for image classification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 859–866, 2013.
[67] Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. Freeman. Learning
the depths of moving people by watching frozen people. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4521–4530, 2019.
[68] M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai. Mask textspotter v3: Segmentation
proposal network for robust scene text spotting. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages
706–722. Springer, 2020.
[69] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zit-
nick. Microsoft coco: Common objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014.
[70] Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang. Abcnet: Real-time scene text
spotting with adaptive bezier-curve network. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 9809–9818, 2020.
[71] S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory,
28(2):129–137, 1982.
[72] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned
multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
[73] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for
visual question answering. Advances in neural information processing systems, 29:289–
297, 2016.
[74] Y . Luo, J. Zhu, M. Li, Y . Ren, and B. Zhang. Smooth neighbors on teacher graphs for
semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 8896–8905, 2018.
[75] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspotter: An end-to-end trainable
neural network for spotting text with arbitrary shapes. In Proceedings of the European
Conference on Computer Vision (ECCV), pages 67–83, 2018.
[76] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning
research, 9(Nov):2579–2605, 2008.
[77] M. J. Marin-Jimenez, F. J. Romero-Ramirez, R. Mu˜ noz-Salinas, and R. Medina-Carnicer.
3d human pose estimation from depth maps using a deep combination of poses. Journal
of Visual Communication and Image Representation, 55:627–639, 2018.
[78] H. M. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
[79] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for
3d human pose estimation. In Proceedings of the IEEE International Conference on
Computer Vision, pages 2640–2649, 2017.
99
[80] M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on document images.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
pages 2200–2209, 2021.
[81] P. Melville and R. J. Mooney. Diverse ensembles for active learning. In Proceedings of
the twenty-first international conference on Machine learning, page 74. ACM, 2004.
[82] Y . Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing
and vector quantizing images with words. In First international workshop on multimedia
intelligent storage and retrieval management, pages 1–9. Citeseer, 1999.
[83] R. Mur-Artal and J. D. Tard´ os. Orb-slam2: An open-source slam system for monocular,
stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
[84] Y . Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y . Ng. Reading digits in
natural images with unsupervised feature learning. 2011.
[85] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of
the twenty-first international conference on Machine learning, page 79. ACM, 2004.
[86] G. Ning, Z. Zhang, and Z. He. Knowledge-guided deep fractal neural networks for human
pose estimation. IEEE Transactions on Multimedia, 20(5):1246–1259, 2017.
[87] T. Ohashi, Y . Ikegami, and Y . Nakamura. Synergetic reconstruction from 2d pose and
3d motion for wide-space multi-person video motion capture in the wild. arXiv preprint
arXiv:2001.05613, 2020.
[88] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy.
Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4903–4911, 2017.
[89] S. Patra and L. Bruzzone. A batch-mode active learning technique based on multiple
uncertainty for svm classifier. IEEE Geoscience and Remote Sensing Letters, 9(3):497–
501, 2012.
[90] A.-I. Popa, M. Zanfir, and C. Sminchisescu. Deep multitask architecture for integrated 2d
and 3d human sensing. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6289–6298, 2017.
[91] S. Qin, A. Bissacco, M. Raptis, Y . Fujii, and Y . Xiao. Towards unconstrained end-to-end
text spotting. In Proceedings of the IEEE/CVF International Conference on Computer
Vision, pages 4704–4714, 2019.
[92] A. Quattoni, M. Collins, and T. Darrell. Learning visual representations using images
with captions. In 2007 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–8. IEEE, 2007.
[93] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language
supervision. arXiv preprint arXiv:2103.00020, 2021.
100
[94] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learn-
ing with ladder networks. In Advances in neural information processing systems, pages
3546–3554, 2015.
[95] J. Ritter. An efficient bounding sphere. Graphics gems, 1:301–303, 1990.
[96] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-
thy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–
252, 2015.
[97] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113,
2016.
[98] J. L. Sch¨ onberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection
for unstructured multi-view stereo. In European Conference on Computer Vision, pages
501–518. Springer, 2016.
[99] S. Sedai, M. Bennamoun, and D. Q. Huynh. A gaussian process guided particle filter for
tracking 3d human pose in video. IEEE Transactions on Image Processing, 22(11):4286–
4300, 2013.
[100] O. Sener and S. Savarese. Active learning for convolutional neural networks: A core-set
approach. 2018.
[101] C. E. Shannon. A mathematical theory of communication. Bell system technical journal,
27(3):379–423, 1948.
[102] Y . Shen, Y . Song, H. Li, S. Kamali, B. Wang, and C.-C. J. Kuo. K-covers for active
learning in image classification. In 2019 IEEE International Conference on Multimedia
& Expo Workshops (ICMEW), pages 288–293. IEEE, 2019.
[103] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[104] A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner. Textocr: Towards
large-scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8802–8812,
2021.
[105] S. Sinha, S. Ebrahimi, and T. Darrell. Variational adversarial active learning. arXiv
preprint arXiv:1904.00370, 2019.
[106] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a
simple way to prevent neural networks from overfitting. The Journal of Machine Learning
Research, 15(1):1929–1958, 2014.
[107] N. Srivastava, R. Salakhutdinov, et al. Multimodal learning with deep boltzmann
machines. In NIPS, volume 1, page 2. Citeseer, 2012.
101
[108] X. Sun, B. Xiao, F. Wei, S. Liang, and Y . Wei. Integral human pose regression. In
Proceedings of the European Conference on Computer Vision (ECCV), pages 529–545,
2018.
[109] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and
L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM,
59(2):64–73, 2016.
[110] S. Tong and D. Koller. Support vector machine active learning with applications to text
classification. Journal of machine learning research, 2(Nov):45–66, 2001.
[111] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1653–1660, 2014.
[112] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset
and benchmark for text detection and recognition in natural images. arXiv preprint
arXiv:1601.07140, 2016.
[113] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll. Recovering
accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of
the European Conference on Computer Vision (ECCV), pages 601–617, 2018.
[114] A. V oulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis. Deep learning for
computer vision: A brief review. Computational intelligence and neuroscience, 2018,
2018.
[115] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-
200-2011 Dataset. Technical report, 2011.
[116] H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y . Xu, M. He, Y . Wang, and W. Liu. All
you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 12160–12167, 2020.
[117] K. Wang, D. Zhang, Y . Li, R. Zhang, and L. Lin. Cost-effective active learning for deep
image classification. IEEE Transactions on Circuits and Systems for Video Technology,
27(12):2591–2600, 2017.
[118] G. Wei, C. Lan, W. Zeng, and Z. Chen. View invariant 3d human pose estimation. IEEE
Transactions on Circuits and Systems for Video Technology, 2019.
[119] Y . Xia, X. Cao, F. Wen, and J. Sun. Well begun is half done: Generating high-quality
seeds for automatic image dataset construction from web. In European Conference on
Computer Vision, pages 387–400. Springer, 2014.
[120] B. Xiao, H. Wu, and Y . Wei. Simple baselines for human pose estimation and tracking.
In Proceedings of the European conference on computer vision (ECCV), pages 466–481,
2018.
102
[121] T. Xiao, T. Xia, Y . Yang, C. Huang, and X. Wang. Learning from massive noisy labeled
data for image classification. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2691–2699, 2015.
[122] Q. Xie, M.-T. Luong, E. Hovy, and Q. V . Le. Self-training with noisy student improves
imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10687–10698, 2020.
[123] X. Xu, Q. Zou, and X. Lin. Multi-person pose estimation with enhanced feature aggrega-
tion and selection. arXiv preprint arXiv:2003.10238, 2020.
[124] J. Yang, L. Wan, W. Xu, and S. Wang. 3d human pose estimation from a single image
via exemplar augmentation. Journal of Visual Communication and Image Representation,
59:371–379, 2019.
[125] L. Yang, Y . Zhang, J. Chen, S. Zhang, and D. Z. Chen. Suggestive annotation: A deep
active learning framework for biomedical image segmentation. In International Confer-
ence on Medical Image Computing and Computer-Assisted Intervention, pages 399–407.
Springer, 2017.
[126] D. Yoo and I. S. Kweon. Learning loss for active learning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 93–102, 2019.
[127] J. Yu, J. Sun, Z. Song, S. Zheng, and B. Wei. Monocular three-dimensional human pose
estimation using local-topology preserved sparse retrieval. Journal of Electronic Imaging,
26(3):033008, 2017.
[128] W. Zhang, D. Kong, S. Wang, and Z. Wang. 3d human pose estimation from range images
with depth difference and geodesic distance. Journal of Visual Communication and Image
Representation, 59:272–282, 2019.
[129] W. Zhang, L. Shang, and A. B. Chan. A robust likelihood function for 3d human pose
tracking. IEEE Transactions on Image Processing, 23(12):5374–5389, 2014.
[130] L. Zhou, J. Liu, Y . Cheng, Z. Gan, and L. Zhang. Cupid: Adaptive curation of pre-training
data for video-and-language representation learning. arXiv preprint arXiv:2104.00285,
2021.
[131] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning
view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
[132] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. In Proceedings of the IEEE international confer-
ence on computer vision, pages 2223–2232, 2017.
[133] X. J. Zhu. Semi-supervised learning literature survey. Technical report, University of
Wisconsin-Madison Department of Computer Sciences, 2005.
103
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Green learning for 3D point cloud data processing
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Local-aware deep learning: methodology and applications
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
3D deep learning for perception and modeling
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Human activity analysis with graph signal processing techniques
PDF
Object localization with deep learning techniques
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Green image generation and label transfer techniques
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Deep representations for shapes, structures and motion
PDF
Novel algorithms for large scale supervised and one class learning
Asset Metadata
Creator
Shen, Yeji
(author)
Core Title
Labeling cost reduction techniques for deep learning: methodologies and applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-12
Publication Date
10/14/2021
Defense Date
09/07/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D human pose estimation,active learning,deep learning,OAI-PMH Harvest,representation learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonia (
committee member
)
Creator Email
morningsyj@gmail.com,yejishen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16208271
Unique identifier
UC16208271
Legacy Identifier
etd-ShenYeji-10162
Document Type
Dissertation
Rights
Shen, Yeji
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
3D human pose estimation
active learning
deep learning
representation learning